Opened 5 years ago

Closed 5 years ago

#1485 closed help (fixed)

8.4 Run Crash Problem similar to ticket #1443

Reported by: scottyiu Owned by: um_support
Component: UM Model Keywords: memory, cce, hadgem3
Cc: Platform: ARCHER
UM Version: 8.4

Description

I have a similar problem to ticket #1443:

When running my 8.4 model, I get the messages:

[NID 04975] 2015-02-16 08:52:35 Apid 12964289: initiated application termination
[NID 04975] 2015-02-16 08:52:36 Apid 12964289: OOM killer terminated this process.
xkzeq: Run failed

I have already typed
chmod -R g+rX /home/n02/n02/scottyiu
chmod -R g+rX /work/n02/n02/scottyiu

I would like to ask if there is a similar solution for my 8.4 model?

Thank you very much.

Attachments (2)

xlchy.beginning.comp.leave (3.6 KB) - added by scottyiu 5 years ago.
The beginning of the .comp.leave file referenced in message 19 of #1485
xlchy.end.comp.leave (8.7 KB) - added by scottyiu 5 years ago.
The end of the .comp.leave file referenced in message 19 of #1485

Download all attachments as: .zip

Change History (29)

comment:1 Changed 5 years ago by scottyiu

I would like to add that I have already added more nodes (similar to ticket #1443) but the error is still there. Originally 12 x 12 processors, now 24 x 24.

comment:2 Changed 5 years ago by grenville

Scott

I think our fix in 1443 will work for you. It simply involves rebuilding the model with the Cray 8.3.3.compiler.

We are working on a system which will make this simple, but for now, it may be easy for you to make the changes needed - are you familiar with the FCM config files?

Grenville

comment:3 Changed 5 years ago by scottyiu

Dear Grenville

Thanks for the reply. I am not very familiar with the FCM config files but am happy to do it manually. Could you please provide me with the instructions needed to carry out the fix?

Thank you very much.

Best regards,
Scott

comment:4 Changed 5 years ago by grenville

Scott

Please look at xjfve for the changes you need.

You only need make 2 changes: navigate to model selection→ compilation and run options → UM user override files

enter the gcom_path variable as /work/n02/n02/wmcginty/gcom4.5

navigate to model selection → input/output control.. → user hand edits

and add

/home/willie/hand_edits/remove_loadcomp.ed

Then rebuild the model.

Grenville

comment:5 Changed 5 years ago by scottyiu

Dear Grenville

Thank you very much for the reply. I will update if this works once the rdf nerc disc is back up on the 21st of March.

Scott

comment:6 Changed 5 years ago by grenville

Scott

OK, but don't feel that you can not run on ARCHER - /work is still available, and running jobs which don't consume vast amounts of it should be fine.

Grenville

comment:7 Changed 5 years ago by scottyiu

Dear Grenville

Thanks for the reply. I have ran the job, it failed with the error message in the .leave file:

cray-mpich/7.0.3(30):ERROR:150: Module 'cray-mpich/7.0.3' conflicts with the currently loaded module(s) 'cray-mpich/6.3.1'
cray-mpich/7.0.3(30):ERROR:102: Tcl command execution failed: conflict cray-mpich

The job id is: xlchs

Could you please help me with this?

Thanks
Scott

comment:8 Changed 5 years ago by grenville

Scott

I'm not sure what's happening - I copied the PBS header information from /home/n02/n02/scottyiu/umui_runs/xlchs-086131815/umuisubmit_compile and submitted a job based on it - here's what I it says about the modules loaded:


* grenvill Job: 2813215.sdb started: 09/04/15 15:16:50 host: esPP002 *
* grenvill Job: 2813215.sdb started: 09/04/15 15:16:50 host: esPP002 *
* grenvill Job: 2813215.sdb started: 09/04/15 15:16:50 host: esPP002 *
* grenvill Job: 2813215.sdb started: 09/04/15 15:16:50 host: esPP002 *


/cm/local/apps/pbspro/var/spool/mom_priv/jobs/2813215.sdb.SC[40]: .[369]: .: line 76: HISTSIZE: is read only
Currently Loaded Modulefiles:

1) modules/3.2.6.7
2) eswrap/1.1.0-1.010400.915.0
3) switch/1.0-1.0501.47124.1.93.ari
4) craype-network-aries
5) craype/2.2.0
6) cce/8.3.3
7) cray-libsci/13.0.1
8) udreg/2.3.2-1.0501.7914.1.13.ari
9) ugni/5.0-1.0501.8253.10.22.ari

10) pmi/5.0.5-1.0000.10300.134.8.ari
11) dmapp/7.0.1-1.0501.8315.8.4.ari
12) gni-headers/3.0-1.0501.8317.12.1.ari
13) xpmem/0.1-2.0501.48424.3.3.ari
14) job/1.5.5-0.1_2.0501.48066.2.43.ari
15) csa/3.0.0-1_2.0501.47112.1.91.ari
16) dvs/2.4_0.9.0-1.0501.1672.2.122.ari
17) alps/5.1.1-2.0501.8507.1.1.ari
18) rca/1.0.0-2.0501.48090.7.46.ari
19) atp/1.7.5
20) PrgEnv?-cray/5.1.29
21) pbs/12.2.401.141761
22) craype-ivybridge
23) cray-mpich/7.0.3
24) packages-archer
25) budgets/1.1
26) checkScript/1.1
27) checkQueue/1.0
28) checkDisk/1.0
29) bolt/0.6
30) serialJobs/1.0
31) python/2.7.6

which looks OK - your compile.leave file says cce8.2.1 is still loaded.

Please cd to /home/n02/n02/scottyiu/umui_runs/xlchs-086131815 and type

./umuisubmit_compile

and confirm that you have the same module list as I got (above).

Grenville

comment:9 Changed 5 years ago by scottyiu

Dear Grenville

Sorry for the late reply, I was on vacation for the last two weeks.

I have ssh into archer and cd to /home/n02/n02/scottyiu/umui_runs/xlchs-086131815 and typed ./umuisubmit_compile. This is the output message:

cray-mpich/7.1.1(30):ERROR:150: Module 'cray-mpich/7.1.1' conflicts with the currently loaded module(s) 'cray-mpich/6.3.1'
cray-mpich/7.1.1(30):ERROR:102: Tcl command execution failed: conflict cray-mpich

Currently Loaded Modulefiles:
1) modules/3.2.10.2
2) eswrap/1.1.0-1.010400.915.0
3) switch/1.0-1.0501.47124.1.93.ari
4) craype-network-aries
5) craype/2.2.1
6) cce/8.2.1
7) cray-libsci/12.2.0
8) udreg/2.3.2-1.0501.7914.1.13.ari
9) ugni/5.0-1.0501.8253.10.22.ari
10) pmi/5.0.6-1.0000.10439.140.2.ari
11) dmapp/7.0.1-1.0501.8315.8.4.ari
12) gni-headers/3.0-1.0501.8317.12.1.ari
13) xpmem/0.1-2.0501.48424.3.3.ari
14) job/1.5.5-0.1_2.0501.48066.2.43.ari
15) csa/3.0.0-1_2.0501.47112.1.91.ari
16) dvs/2.4_0.9.0-1.0501.1672.2.122.ari
17) alps/5.1.1-2.0501.8507.1.1.ari
18) rca/1.0.0-2.0501.48090.7.46.ari
19) atp/1.7.5
20) PrgEnv?-cray/5.1.29
21) pbs/12.2.401.141761
22) craype-ivybridge
23) cray-mpich/6.3.1
24) packages-archer
25) budgets/1.1
26) checkScript/1.1
27) checkQueue/1.0
28) checkDisk/1.0
29) bolt/0.6
30) serialJobs/1.0
31) python/2.7.6
32) tkdiff/4.2
33) nano/2.2.6
34) imagemagick/6.8.8-2
35) leave_time/1.0.0
36) quickstart/1.0
37) epcc-tools/1.4
38) stat/2.1.0.1
39) cray-netcdf/4.3.1

It seems the modules are different in our versions.

Thank you very much.
Scott

comment:10 Changed 5 years ago by grenville

Scott

This says you are still loading cce/8.2.1.

We have refined this little.

Please navigate to selection→compilation and run ..→UM user override files, add

~willie/overrides/use_cce_mygcom4.5_v2

to the list of user machine overrides (remove the gcom_path variable as /work/n02/n02/wmcginty/gcom4.5)

Please make the model perform a full build - go to FCM extract directories and output levels and check "Force full build".

Check the comp.leave file to see that cce8.3.3 is was loaded.

Grenville

comment:11 Changed 5 years ago by grenville

Scott

Please don't do this yet - Cray have changed the default modules again, so I need to test this again first.

Grenville

comment:12 Changed 5 years ago by scottyiu

Dear Grenville

Thanks for the replies. The force full build option is greyed out, is there something I have to do to enable it?

Thank you.
Scott

comment:13 Changed 5 years ago by scottyiu

Dear Grenville

I figured out the last comment. Please just inform me when you finished the testing with the new Cray default modules.

Thank you!

Best regards,
Scott

comment:14 Changed 5 years ago by grenville

Scott

I have tested an 8.4 job built with cce8.3.7- it worked OK — for reference the job is xfjvz.

In order to build with cce8.3.7, you'll need to rebuild the model with the three changes as shown below:

1) include in the hand edits section -

/home/grenville/umui_jobs/hand_edits/remove_loadcomp.ed

2) add to the User paths overrides table in Compilation and Run Options → UM Override files -

%gcom_path /work/n02/n02/hum/gcom/cce8.3.7/gcom5.1 Y

In getting this working, we discovered a possible error in cnislt_distribute_poles_c96_1c.F90 - this can be avoided by

3) include

MPICH_NO_BUFFER_ALIAS_CHECK = 1

in the Scripts Inserts and Modifications section of the UMUI.

Please try this out on a small run first.

Please don't include overrides or hand edits mentioned in the earlier invalid solution to this problem.

Grenville

comment:15 Changed 5 years ago by scottyiu

Dear Grenville

Thank you for testing the job for me. I will now try submitting the job for a short run (2 days) and a 1 year run and will get back to you when the job finishes.

Thank you!

Best regards,
Scott

Last edited 5 years ago by scottyiu (previous) (diff)

comment:16 Changed 5 years ago by scottyiu

Dear Grenville

I have already placed the three changes in and forced a full extract + full build. However, the .comp.leave file is still saying:

cray-mpich/7.1.1(30):ERROR:150: Module 'cray-mpich/7.1.1' conflicts with the currently loaded module(s) 'cray-mpich/6.3.1'
cray-mpich/7.1.1(30):ERROR:102: Tcl command execution failed: conflict cray-mpich

/home/n02/n02/scottyiu/umui_runs/xlchx-126163749/umuisubmit_compile[40]: .[369]: .: line 76: HISTSIZE: is read only
Currently Loaded Modulefiles:

1) modules/3.2.10.2
2) eswrap/1.1.0-1.010400.915.0
3) switch/1.0-1.0501.47124.1.93.ari
4) craype-network-aries
5) craype/2.2.1
6) cce/8.2.1
7) cray-libsci/12.2.0
8) udreg/2.3.2-1.0501.7914.1.13.ari
9) ugni/5.0-1.0501.8253.10.22.ari

10) pmi/5.0.6-1.0000.10439.140.2.ari
11) dmapp/7.0.1-1.0501.8315.8.4.ari
12) gni-headers/3.0-1.0501.8317.12.1.ari
13) xpmem/0.1-2.0501.48424.3.3.ari
14) job/1.5.5-0.1_2.0501.48066.2.43.ari
15) csa/3.0.0-1_2.0501.47112.1.91.ari
16) dvs/2.4_0.9.0-1.0501.1672.2.122.ari
17) alps/5.1.1-2.0501.8507.1.1.ari
18) rca/1.0.0-2.0501.48090.7.46.ari
19) atp/1.7.5
20) PrgEnv?-cray/5.1.29
21) pbs/12.2.401.141761
22) craype-ivybridge
23) cray-mpich/6.3.1
24) packages-archer
25) budgets/1.1
26) checkScript/1.1
27) checkQueue/1.0
28) checkDisk/1.0
29) bolt/0.6
30) serialJobs/1.0
31) python/2.7.6
32) tkdiff/4.2
33) nano/2.2.6
34) imagemagick/6.8.8-2
35) leave_time/1.0.0
36) quickstart/1.0
37) epcc-tools/1.4
38) stat/2.1.0.1

The job id is xlchx

Thank you.

Scott

comment:17 Changed 5 years ago by grenville

Scott

Please remove

loadcomp $TARGET_MC

from your .profile and try again.

Grenville

comment:18 Changed 5 years ago by scottyiu

Dear Grenville

Thank you for the reply. I will submit the job again and get back to you once it ran.

Scott

comment:19 Changed 5 years ago by scottyiu

Dear Grenville

I have submitted the job. It still cannot get pass the compile stage (only .comp.leave file, no .leave file). I have attached the .comp.leave file.

It says at the end of the .comp.leave file:

MODULE locate_hdps_mod
       ^               
ftn-855 crayftn: ERROR LOCATE_HDPS_MOD, File = ../../../../../../../../home2/n02/n02/scottyiu/um/xlchy/umatmos/ppsrc/UM/atmosphere/dynamics_advection/locate_hdps.f90, Line = 9, Column = 8 
  The compiler has detected errors in module "LOCATE_HDPS_MOD".  No module information file will be created for this module.

           id              = i_lkup_u( my_floor(temp) )
                                       ^                
ftn-319 crayftn: ERROR LOCATE_HDPS, File = ../../../../../../../../home2/n02/n02/scottyiu/um/xlchy/umatmos/ppsrc/UM/atmosphere/dynamics_advection/locate_hdps.f90, Line = 124, Column = 40 
  A subscript must be a scalar integer expression.

           id              = i_lkup_p( 1 + my_floor(temp) )
                                         ^                  
ftn-319 crayftn: ERROR LOCATE_HDPS, File = ../../../../../../../../home2/n02/n02/scottyiu/um/xlchy/umatmos/ppsrc/UM/atmosphere/dynamics_advection/locate_hdps.f90, Line = 142, Column = 42 
  A subscript must be a scalar integer expression.

           id              = j_lkup_v( my_floor(temp) )
                                       ^                
ftn-319 crayftn: ERROR LOCATE_HDPS, File = ../../../../../../../../home2/n02/n02/scottyiu/um/xlchy/umatmos/ppsrc/UM/atmosphere/dynamics_advection/locate_hdps.f90, Line = 164, Column = 40 
  A subscript must be a scalar integer expression.

           id              = j_lkup_p( 1 + my_floor(temp) )
                                         ^                  
ftn-319 crayftn: ERROR LOCATE_HDPS, File = ../../../../../../../../home2/n02/n02/scottyiu/um/xlchy/umatmos/ppsrc/UM/atmosphere/dynamics_advection/locate_hdps.f90, Line = 182, Column = 42 
  A subscript must be a scalar integer expression.

Cray Fortran : Version 8.3.7 (u83058f83196i83174p83310a83009e83011z83310)
Cray Fortran :               (x8318r83015w83011t8311b83037)
Cray Fortran : Tue May 12, 2015  13:30:49
Cray Fortran : Compile time:  0.0320 seconds
Cray Fortran : 202 source lines
Cray Fortran : 5 errors, 0 warnings, 0 other messages, 0 ansi
Cray Fortran : "explain ftn-message number" gives more information about each message.
fcm_internal compile failed (256)
# Time taken:            0 s=> ftn -o locate_hdps_mod.o -I/home/n02/n02/scottyiu/um/xlchy/umatmos/inc -I/home/n02/n02/scottyiu/um/xlchy/baserepos/JULES/inc -I/home/n02/n02/scottyiu/um/xlchy/baserepos/UMATMOS/inc -e m -h noomp -s real64 -s integer64 -hflex_mp=intolerant -I /work/n02/n02/hum/gcom/cce8.3.7/gcom5.1/archer_cce_mpp/inc     -h omp -c /home/n02/n02/scottyiu/um/xlchy/umatmos/ppsrc/UM/atmosphere/dynamics_advection/locate_hdps.f90
gmake: *** [locate_hdps_mod.o] Error 1
# Time taken:         3356 s=> gmake -f /home/n02/n02/scottyiu/um/xlchy/umatmos/Makefile -j 1 all
gmake -f /home/n02/n02/scottyiu/um/xlchy/umatmos/Makefile -j 1 all failed (2) at /fs2/n02/n02/hum/software/fcm-2015.03.0/bin/../lib/FCM1/Build.pm line 611
cd /home2/n02/n02/scottyiu
Build failed on Tue May 12 13:30:49 2015.
->Make: 3356 seconds
->TOTAL: 3763 seconds
UMATMOS build failed

It seems that there is still some problems with the gcom? I have attached the .comp.leave file for your reference.

Thank you very much.

Best regards,
Scott

Changed 5 years ago by scottyiu

The beginning of the .comp.leave file referenced in message 19 of #1485

Changed 5 years ago by scottyiu

The end of the .comp.leave file referenced in message 19 of #1485

comment:20 Changed 5 years ago by scottyiu

Dear Grenville

I cannot attach the whole .comp.leave file as it is too big, however, I have attached the beginning and end of the .comp.leave file.

Thank you very much.

Best regards,
Scott

comment:21 Changed 5 years ago by scottyiu

  • priority changed from normal to high

comment:22 Changed 5 years ago by grenville

Scott

Using the latest compiler exposed some poor programming in the UM.

I copied your job xlchy as xjfvy.

I fixed the code problem in my working copy of the vn8.4_ncas branch - the model built and ran OK.

Please test my change by using my working copy of vn8.4_ncas branch in your job (see FCM Configuragtion→FCM Options for Atmosphere and ….; note I switched off the ncas branch in the User modifications table, and included /home/grenville/um_vn8.4/vn8.4_ncas/src in the User working copy location

Grenville

comment:23 Changed 5 years ago by scottyiu

Dear Grenville

Thank you for your reply. I will try running the job and get back to you.

Best regards,
Scott

comment:24 Changed 5 years ago by grenville

Scott

You need top rebuild the model with my working copy of vn8.4_ncas.

Grenville

comment:25 Changed 5 years ago by scottyiu

Dear Grenville

I guess that means a full rebuild + compiled? In that case, I have already checked that option in the run.

Thank you very much.

Best regards,
Scott

comment:26 Changed 5 years ago by scottyiu

Dear Grenville

I have now ran 1 year and 3 months and there was no problem. I think it is fixed!

Thank you very (very) much!

Best regards,
Scott

comment:27 Changed 5 years ago by scottyiu

  • Resolution set to fixed
  • Status changed from new to closed
Note: See TracTickets for help on using tickets.