Opened 6 years ago
Closed 6 years ago
#1485 closed help (fixed)
8.4 Run Crash Problem similar to ticket #1443
Reported by: | scottyiu | Owned by: | um_support |
---|---|---|---|
Component: | UM Model | Keywords: | memory, cce, hadgem3 |
Cc: | Platform: | ARCHER | |
UM Version: | 8.4 |
Description
I have a similar problem to ticket #1443:
When running my 8.4 model, I get the messages:
[NID 04975] 2015-02-16 08:52:35 Apid 12964289: initiated application termination
[NID 04975] 2015-02-16 08:52:36 Apid 12964289: OOM killer terminated this process.
xkzeq: Run failed
I have already typed
chmod -R g+rX /home/n02/n02/scottyiu
chmod -R g+rX /work/n02/n02/scottyiu
I would like to ask if there is a similar solution for my 8.4 model?
Thank you very much.
Attachments (2)
Change History (29)
comment:1 Changed 6 years ago by scottyiu
comment:2 Changed 6 years ago by grenville
Scott
I think our fix in 1443 will work for you. It simply involves rebuilding the model with the Cray 8.3.3.compiler.
We are working on a system which will make this simple, but for now, it may be easy for you to make the changes needed - are you familiar with the FCM config files?
Grenville
comment:3 Changed 6 years ago by scottyiu
Dear Grenville
Thanks for the reply. I am not very familiar with the FCM config files but am happy to do it manually. Could you please provide me with the instructions needed to carry out the fix?
Thank you very much.
Best regards,
Scott
comment:4 Changed 6 years ago by grenville
Scott
Please look at xjfve for the changes you need.
You only need make 2 changes: navigate to model selection→ compilation and run options → UM user override files
enter the gcom_path variable as /work/n02/n02/wmcginty/gcom4.5
navigate to model selection → input/output control.. → user hand edits
and add
/home/willie/hand_edits/remove_loadcomp.ed
Then rebuild the model.
Grenville
comment:5 Changed 6 years ago by scottyiu
Dear Grenville
Thank you very much for the reply. I will update if this works once the rdf nerc disc is back up on the 21st of March.
Scott
comment:6 Changed 6 years ago by grenville
Scott
OK, but don't feel that you can not run on ARCHER - /work is still available, and running jobs which don't consume vast amounts of it should be fine.
Grenville
comment:7 Changed 6 years ago by scottyiu
Dear Grenville
Thanks for the reply. I have ran the job, it failed with the error message in the .leave file:
cray-mpich/7.0.3(30):ERROR:150: Module 'cray-mpich/7.0.3' conflicts with the currently loaded module(s) 'cray-mpich/6.3.1'
cray-mpich/7.0.3(30):ERROR:102: Tcl command execution failed: conflict cray-mpich
The job id is: xlchs
Could you please help me with this?
Thanks
Scott
comment:8 Changed 6 years ago by grenville
Scott
I'm not sure what's happening - I copied the PBS header information from /home/n02/n02/scottyiu/umui_runs/xlchs-086131815/umuisubmit_compile and submitted a job based on it - here's what I it says about the modules loaded:
* grenvill Job: 2813215.sdb started: 09/04/15 15:16:50 host: esPP002 *
* grenvill Job: 2813215.sdb started: 09/04/15 15:16:50 host: esPP002 *
* grenvill Job: 2813215.sdb started: 09/04/15 15:16:50 host: esPP002 *
* grenvill Job: 2813215.sdb started: 09/04/15 15:16:50 host: esPP002 *
/cm/local/apps/pbspro/var/spool/mom_priv/jobs/2813215.sdb.SC[40]: .[369]: .: line 76: HISTSIZE: is read only
Currently Loaded Modulefiles:
1) modules/3.2.6.7
2) eswrap/1.1.0-1.010400.915.0
3) switch/1.0-1.0501.47124.1.93.ari
4) craype-network-aries
5) craype/2.2.0
6) cce/8.3.3
7) cray-libsci/13.0.1
8) udreg/2.3.2-1.0501.7914.1.13.ari
9) ugni/5.0-1.0501.8253.10.22.ari
10) pmi/5.0.5-1.0000.10300.134.8.ari
11) dmapp/7.0.1-1.0501.8315.8.4.ari
12) gni-headers/3.0-1.0501.8317.12.1.ari
13) xpmem/0.1-2.0501.48424.3.3.ari
14) job/1.5.5-0.1_2.0501.48066.2.43.ari
15) csa/3.0.0-1_2.0501.47112.1.91.ari
16) dvs/2.4_0.9.0-1.0501.1672.2.122.ari
17) alps/5.1.1-2.0501.8507.1.1.ari
18) rca/1.0.0-2.0501.48090.7.46.ari
19) atp/1.7.5
20) PrgEnv?-cray/5.1.29
21) pbs/12.2.401.141761
22) craype-ivybridge
23) cray-mpich/7.0.3
24) packages-archer
25) budgets/1.1
26) checkScript/1.1
27) checkQueue/1.0
28) checkDisk/1.0
29) bolt/0.6
30) serialJobs/1.0
31) python/2.7.6
which looks OK - your compile.leave file says cce8.2.1 is still loaded.
Please cd to /home/n02/n02/scottyiu/umui_runs/xlchs-086131815 and type
./umuisubmit_compile
and confirm that you have the same module list as I got (above).
Grenville
comment:9 Changed 6 years ago by scottyiu
Dear Grenville
Sorry for the late reply, I was on vacation for the last two weeks.
I have ssh into archer and cd to /home/n02/n02/scottyiu/umui_runs/xlchs-086131815 and typed ./umuisubmit_compile. This is the output message:
cray-mpich/7.1.1(30):ERROR:150: Module 'cray-mpich/7.1.1' conflicts with the currently loaded module(s) 'cray-mpich/6.3.1'
cray-mpich/7.1.1(30):ERROR:102: Tcl command execution failed: conflict cray-mpich
Currently Loaded Modulefiles:
1) modules/3.2.10.2
2) eswrap/1.1.0-1.010400.915.0
3) switch/1.0-1.0501.47124.1.93.ari
4) craype-network-aries
5) craype/2.2.1
6) cce/8.2.1
7) cray-libsci/12.2.0
8) udreg/2.3.2-1.0501.7914.1.13.ari
9) ugni/5.0-1.0501.8253.10.22.ari
10) pmi/5.0.6-1.0000.10439.140.2.ari
11) dmapp/7.0.1-1.0501.8315.8.4.ari
12) gni-headers/3.0-1.0501.8317.12.1.ari
13) xpmem/0.1-2.0501.48424.3.3.ari
14) job/1.5.5-0.1_2.0501.48066.2.43.ari
15) csa/3.0.0-1_2.0501.47112.1.91.ari
16) dvs/2.4_0.9.0-1.0501.1672.2.122.ari
17) alps/5.1.1-2.0501.8507.1.1.ari
18) rca/1.0.0-2.0501.48090.7.46.ari
19) atp/1.7.5
20) PrgEnv?-cray/5.1.29
21) pbs/12.2.401.141761
22) craype-ivybridge
23) cray-mpich/6.3.1
24) packages-archer
25) budgets/1.1
26) checkScript/1.1
27) checkQueue/1.0
28) checkDisk/1.0
29) bolt/0.6
30) serialJobs/1.0
31) python/2.7.6
32) tkdiff/4.2
33) nano/2.2.6
34) imagemagick/6.8.8-2
35) leave_time/1.0.0
36) quickstart/1.0
37) epcc-tools/1.4
38) stat/2.1.0.1
39) cray-netcdf/4.3.1
It seems the modules are different in our versions.
Thank you very much.
Scott
comment:10 Changed 6 years ago by grenville
Scott
This says you are still loading cce/8.2.1.
We have refined this little.
Please navigate to selection→compilation and run ..→UM user override files, add
~willie/overrides/use_cce_mygcom4.5_v2
to the list of user machine overrides (remove the gcom_path variable as /work/n02/n02/wmcginty/gcom4.5)
Please make the model perform a full build - go to FCM extract directories and output levels and check "Force full build".
Check the comp.leave file to see that cce8.3.3 is was loaded.
Grenville
comment:11 Changed 6 years ago by grenville
Scott
Please don't do this yet - Cray have changed the default modules again, so I need to test this again first.
Grenville
comment:12 Changed 6 years ago by scottyiu
Dear Grenville
Thanks for the replies. The force full build option is greyed out, is there something I have to do to enable it?
Thank you.
Scott
comment:13 Changed 6 years ago by scottyiu
Dear Grenville
I figured out the last comment. Please just inform me when you finished the testing with the new Cray default modules.
Thank you!
Best regards,
Scott
comment:14 Changed 6 years ago by grenville
Scott
I have tested an 8.4 job built with cce8.3.7- it worked OK — for reference the job is xfjvz.
In order to build with cce8.3.7, you'll need to rebuild the model with the three changes as shown below:
1) include in the hand edits section -
/home/grenville/umui_jobs/hand_edits/remove_loadcomp.ed
2) add to the User paths overrides table in Compilation and Run Options → UM Override files -
%gcom_path /work/n02/n02/hum/gcom/cce8.3.7/gcom5.1 Y
In getting this working, we discovered a possible error in cnislt_distribute_poles_c96_1c.F90 - this can be avoided by
3) include
MPICH_NO_BUFFER_ALIAS_CHECK = 1
in the Scripts Inserts and Modifications section of the UMUI.
Please try this out on a small run first.
Please don't include overrides or hand edits mentioned in the earlier invalid solution to this problem.
Grenville
comment:15 Changed 6 years ago by scottyiu
Dear Grenville
Thank you for testing the job for me. I will now try submitting the job for a short run (2 days) and a 1 year run and will get back to you when the job finishes.
Thank you!
Best regards,
Scott
comment:16 Changed 6 years ago by scottyiu
Dear Grenville
I have already placed the three changes in and forced a full extract + full build. However, the .comp.leave file is still saying:
cray-mpich/7.1.1(30):ERROR:150: Module 'cray-mpich/7.1.1' conflicts with the currently loaded module(s) 'cray-mpich/6.3.1'
cray-mpich/7.1.1(30):ERROR:102: Tcl command execution failed: conflict cray-mpich
/home/n02/n02/scottyiu/umui_runs/xlchx-126163749/umuisubmit_compile[40]: .[369]: .: line 76: HISTSIZE: is read only
Currently Loaded Modulefiles:
1) modules/3.2.10.2
2) eswrap/1.1.0-1.010400.915.0
3) switch/1.0-1.0501.47124.1.93.ari
4) craype-network-aries
5) craype/2.2.1
6) cce/8.2.1
7) cray-libsci/12.2.0
8) udreg/2.3.2-1.0501.7914.1.13.ari
9) ugni/5.0-1.0501.8253.10.22.ari
10) pmi/5.0.6-1.0000.10439.140.2.ari
11) dmapp/7.0.1-1.0501.8315.8.4.ari
12) gni-headers/3.0-1.0501.8317.12.1.ari
13) xpmem/0.1-2.0501.48424.3.3.ari
14) job/1.5.5-0.1_2.0501.48066.2.43.ari
15) csa/3.0.0-1_2.0501.47112.1.91.ari
16) dvs/2.4_0.9.0-1.0501.1672.2.122.ari
17) alps/5.1.1-2.0501.8507.1.1.ari
18) rca/1.0.0-2.0501.48090.7.46.ari
19) atp/1.7.5
20) PrgEnv?-cray/5.1.29
21) pbs/12.2.401.141761
22) craype-ivybridge
23) cray-mpich/6.3.1
24) packages-archer
25) budgets/1.1
26) checkScript/1.1
27) checkQueue/1.0
28) checkDisk/1.0
29) bolt/0.6
30) serialJobs/1.0
31) python/2.7.6
32) tkdiff/4.2
33) nano/2.2.6
34) imagemagick/6.8.8-2
35) leave_time/1.0.0
36) quickstart/1.0
37) epcc-tools/1.4
38) stat/2.1.0.1
The job id is xlchx
Thank you.
Scott
comment:17 Changed 6 years ago by grenville
Scott
Please remove
loadcomp $TARGET_MC
from your .profile and try again.
Grenville
comment:18 Changed 6 years ago by scottyiu
Dear Grenville
Thank you for the reply. I will submit the job again and get back to you once it ran.
Scott
comment:19 Changed 6 years ago by scottyiu
Dear Grenville
I have submitted the job. It still cannot get pass the compile stage (only .comp.leave file, no .leave file). I have attached the .comp.leave file.
It says at the end of the .comp.leave file:
MODULE locate_hdps_mod ^ ftn-855 crayftn: ERROR LOCATE_HDPS_MOD, File = ../../../../../../../../home2/n02/n02/scottyiu/um/xlchy/umatmos/ppsrc/UM/atmosphere/dynamics_advection/locate_hdps.f90, Line = 9, Column = 8 The compiler has detected errors in module "LOCATE_HDPS_MOD". No module information file will be created for this module. id = i_lkup_u( my_floor(temp) ) ^ ftn-319 crayftn: ERROR LOCATE_HDPS, File = ../../../../../../../../home2/n02/n02/scottyiu/um/xlchy/umatmos/ppsrc/UM/atmosphere/dynamics_advection/locate_hdps.f90, Line = 124, Column = 40 A subscript must be a scalar integer expression. id = i_lkup_p( 1 + my_floor(temp) ) ^ ftn-319 crayftn: ERROR LOCATE_HDPS, File = ../../../../../../../../home2/n02/n02/scottyiu/um/xlchy/umatmos/ppsrc/UM/atmosphere/dynamics_advection/locate_hdps.f90, Line = 142, Column = 42 A subscript must be a scalar integer expression. id = j_lkup_v( my_floor(temp) ) ^ ftn-319 crayftn: ERROR LOCATE_HDPS, File = ../../../../../../../../home2/n02/n02/scottyiu/um/xlchy/umatmos/ppsrc/UM/atmosphere/dynamics_advection/locate_hdps.f90, Line = 164, Column = 40 A subscript must be a scalar integer expression. id = j_lkup_p( 1 + my_floor(temp) ) ^ ftn-319 crayftn: ERROR LOCATE_HDPS, File = ../../../../../../../../home2/n02/n02/scottyiu/um/xlchy/umatmos/ppsrc/UM/atmosphere/dynamics_advection/locate_hdps.f90, Line = 182, Column = 42 A subscript must be a scalar integer expression. Cray Fortran : Version 8.3.7 (u83058f83196i83174p83310a83009e83011z83310) Cray Fortran : (x8318r83015w83011t8311b83037) Cray Fortran : Tue May 12, 2015 13:30:49 Cray Fortran : Compile time: 0.0320 seconds Cray Fortran : 202 source lines Cray Fortran : 5 errors, 0 warnings, 0 other messages, 0 ansi Cray Fortran : "explain ftn-message number" gives more information about each message. fcm_internal compile failed (256) # Time taken: 0 s=> ftn -o locate_hdps_mod.o -I/home/n02/n02/scottyiu/um/xlchy/umatmos/inc -I/home/n02/n02/scottyiu/um/xlchy/baserepos/JULES/inc -I/home/n02/n02/scottyiu/um/xlchy/baserepos/UMATMOS/inc -e m -h noomp -s real64 -s integer64 -hflex_mp=intolerant -I /work/n02/n02/hum/gcom/cce8.3.7/gcom5.1/archer_cce_mpp/inc -h omp -c /home/n02/n02/scottyiu/um/xlchy/umatmos/ppsrc/UM/atmosphere/dynamics_advection/locate_hdps.f90 gmake: *** [locate_hdps_mod.o] Error 1 # Time taken: 3356 s=> gmake -f /home/n02/n02/scottyiu/um/xlchy/umatmos/Makefile -j 1 all gmake -f /home/n02/n02/scottyiu/um/xlchy/umatmos/Makefile -j 1 all failed (2) at /fs2/n02/n02/hum/software/fcm-2015.03.0/bin/../lib/FCM1/Build.pm line 611 cd /home2/n02/n02/scottyiu Build failed on Tue May 12 13:30:49 2015. ->Make: 3356 seconds ->TOTAL: 3763 seconds UMATMOS build failed
It seems that there is still some problems with the gcom? I have attached the .comp.leave file for your reference.
Thank you very much.
Best regards,
Scott
Changed 6 years ago by scottyiu
The beginning of the .comp.leave file referenced in message 19 of #1485
comment:20 Changed 6 years ago by scottyiu
Dear Grenville
I cannot attach the whole .comp.leave file as it is too big, however, I have attached the beginning and end of the .comp.leave file.
Thank you very much.
Best regards,
Scott
comment:21 Changed 6 years ago by scottyiu
- priority changed from normal to high
comment:22 Changed 6 years ago by grenville
Scott
Using the latest compiler exposed some poor programming in the UM.
I copied your job xlchy as xjfvy.
I fixed the code problem in my working copy of the vn8.4_ncas branch - the model built and ran OK.
Please test my change by using my working copy of vn8.4_ncas branch in your job (see FCM Configuragtion→FCM Options for Atmosphere and ….; note I switched off the ncas branch in the User modifications table, and included /home/grenville/um_vn8.4/vn8.4_ncas/src in the User working copy location
Grenville
comment:23 Changed 6 years ago by scottyiu
Dear Grenville
Thank you for your reply. I will try running the job and get back to you.
Best regards,
Scott
comment:24 Changed 6 years ago by grenville
Scott
You need top rebuild the model with my working copy of vn8.4_ncas.
Grenville
comment:25 Changed 6 years ago by scottyiu
Dear Grenville
I guess that means a full rebuild + compiled? In that case, I have already checked that option in the run.
Thank you very much.
Best regards,
Scott
comment:26 Changed 6 years ago by scottyiu
Dear Grenville
I have now ran 1 year and 3 months and there was no problem. I think it is fixed!
Thank you very (very) much!
Best regards,
Scott
comment:27 Changed 6 years ago by scottyiu
- Resolution set to fixed
- Status changed from new to closed
I would like to add that I have already added more nodes (similar to ticket #1443) but the error is still there. Originally 12 x 12 processors, now 24 x 24.