Opened 6 years ago
Closed 5 years ago
#1357 closed help (wontfix)
Problems with CRUN (and climate meaning) for diagnostics s30i310-316 (EP flux diagnostics)
Reported by: | luke | Owned by: | um_support |
---|---|---|---|
Component: | UM Model | Keywords: | CRUN, EP flux diagnostics |
Cc: | Platform: | ARCHER | |
UM Version: | 8.4 |
Description
I am porting MONSooN job xkawa across to ARCHER, basing it on Karthee's port (of my similar xjcim job) xjnjb. Further information on these jobs can be found here:
http://www.ukca.ac.uk/wiki/index.php/Vn8.4_GA4.0_Release_Candidate:_xkawa
While xkawa ran fine for 10 years on MONSooN, xkawe (the ARCHER port) has had a few problems. Most of these I have been able to solve, but the latest is more complicated.
I had originally noticed a few days ago that I had problems with climate meaning when sending diagnostics s30i310-316 to the UPMEAN stream (they were also being calculated daily and sent to the UPA stream). The set up for xkawa is
30 310 RESIDUAL MN MERID. CIRC. VSTARBAR TDAYM DP36CCM UPA Y + Y 30 310 RESIDUAL MN MERID. CIRC. VSTARBAR TDMPMN DP36CCM UPMEAN Y + Y 30 311 RESIDUAL MN MERID. CIRC. WSTARBAR TDAYM DP36CCM UPA Y + Y 30 311 RESIDUAL MN MERID. CIRC. WSTARBAR TDMPMN DP36CCM UPMEAN Y + Y 30 312 ELIASSEN-PALM FLUX (MERID. COMPNT) TDAYM DP36CCM UPA Y + Y 30 312 ELIASSEN-PALM FLUX (MERID. COMPNT) TDMPMN DP36CCM UPMEAN Y + Y 30 313 ELIASSEN-PALM FLUX (VERT. COMPNT) TDAYM DP36CCM UPA Y + Y 30 313 ELIASSEN-PALM FLUX (VERT. COMPNT) TDMPMN DP36CCM UPMEAN Y + Y 30 314 DIVERGENCE OF ELIASSEN-PALM FLUX TDAYM DP36CCM UPA Y + Y 30 314 DIVERGENCE OF ELIASSEN-PALM FLUX TDMPMN DP36CCM UPMEAN Y + Y 30 315 MERIDIONAL HEAT FLUX TDAYM DP36CCM UPA Y + Y 30 315 MERIDIONAL HEAT FLUX TDMPMN DP36CCM UPMEAN Y + Y 30 316 MERIDIONAL MOMENTUM FLUX TDAYM DP36CCM UPA Y + Y 30 316 MERIDIONAL MOMENTUM FLUX TDMPMN DP36CCM UPMEAN Y + Y
However, on ARCHER I had the following traceback from ATP
Application 9859855 is crashing. ATP analysis proceeding... ATP Stack walkback for Rank 0 starting: _start@start.S:113 __libc_start_main@libc-start.c:226 flumemain_@flumeMain.f90:48 um_shell_@um_shell.f90:1865 u_model_@u_model.f90:3730 meanctl_@meanctl.f90:3631 acumps_@acumps.f90:1475 general_scatter_field_@general_scatter_field.f90:1098 stash_scatter_field_@stash_scatter_field.f90:955 gcg_ralltoalle_@gcg_ralltoalle.f90:180 gcg__ralltoalle_multi_@gcg_ralltoalle_multi.f90:335 ATP Stack walkback for Rank 0 done Process died with signal 11: 'Segmentation fault' Forcing core dumps of ranks 0, 1, 12, 26, 118
(see /home/n02/n02/luke/output/xkawe000.xkawe.d14234.t121221.leave)
I then switched these diagnostics to the UPB stream using TMONMN. This then allowed the job to complete a NRUN step quite happily.
However, when I then turn the model to a CRUN step, it then hangs. The ATP traceback here is
=>> PBS: job killed: walltime 839 exceeded limit 800 aprun: Apid 9883974: Caught signal Terminated, sending to application Application 9883974 is crashing. ATP analysis proceeding... /home/n02/n02/luke/umui_runs/xkawe-240101650/umuisubmit_run[349]: .: line 265: 15667: Terminated ATP Stack walkback for Rank 72 starting: _start@start.S:113 __libc_start_main@libc-start.c:242 flumemain_@flumeMain.f90:48 um_shell_@um_shell.f90:1865 u_model_@u_model.f90:2051 initial_@initial.f90:2610 initdump_@initdump.f90:3502 um_readdump_@um_readdump.f90:954 um_read_multi_@um_read_multi.f90:560 general_scatter_field_@general_scatter_field.f90:1098 stash_scatter_field_@stash_scatter_field.f90:955 gcg_ralltoalle_@gcg_ralltoalle.f90:180 gcg__ralltoalle_multi_@gcg_ralltoalle_multi.f90:398 mpl_waitall_@mpl_waitall.f90:48 MPI_WAITALL@0x18a549a PMPI_Waitall@0x18d3717 MPIR_Waitall_impl@0x18d320a MPIDI_CH3I_Progress@0x18fd877 MPID_nem_gni_poll@0x1913631 MPID_nem_gni_check_localCQ@0x1912351 GNI_CqGetEvent@0x19c8302 ATP Stack walkback for Rank 72 done Process died with signal 15: 'Terminated' Forcing core dumps of ranks 72, 0, 1, 29, 30, 33, 36, 49, 22, 73 -------------------------------------------------------------------------------- Resources requested: ncpus=288,place=free,walltime=00:13:20 Resources allocated: cpupercent=0,cput=00:00:02,mem=18312kb,ncpus=288,vmem=309276kb,walltime=00:13:59
(see /home/n02/n02/luke/output/xkawe000.xkawe.d14240.t101653.leave)
Which looked very similar to the error above. I then began playing around with the job settings (dumping every 24 timesteps to maintain climate meaning over a day, and a 2-day job with 1-day jobstep) and have turned these diagnostics on and off. It doesn't matter which stream they go to, the error remains the same. Without them included then job CRUNs.
Due to the usefulness of these diagnostics, I would like to release a job with them included. Any advice as to how to proceed, or thoughts as to what might be going on, would be greatly appreciated!
Many thanks,
Luke
Change History (13)
comment:1 Changed 6 years ago by simon
comment:2 Changed 6 years ago by luke
Hi Simon,
I tried testing these, but neither worked:
1) Tested with job xkjgc. CRUN ATP traceback is:
=>> PBS: job killed: walltime 847 exceeded limit 800 aprun: Apid 9902441: Caught signal Terminated, sending to application Application 9902441 is crashing. ATP analysis proceeding... /home/n02/n02/luke/umui_runs/xkjgc-241200532/umuisubmit_run[349]: .: line 265: 3749: Terminated ATP Stack walkback for Rank 72 starting: [empty]@0xffffffffffffffff gcg_ralltoalle_@gcg_ralltoalle.f90:175 gcg__ralltoalle_@gcg_ralltoalle.f90:367 ATP Stack walkback for Rank 72 done Process died with signal 15: 'Terminated' Forcing core dumps of ranks 72, 5, 12, 15, 31, 34, 86, 0, 1, 89, 91, 137 --------------------------------------------------------------------------------
2) Tested with xkjgb, where GCOM collectives limit was set to 288. Job only uses 144 cores on 12 nodes. CRUN ATP traceback is:
=>> PBS: job killed: walltime 884 exceeded limit 800 aprun: Apid 9900309: Caught signal Terminated, sending to application Application 9900309 is crashing. ATP analysis proceeding... /home/n02/n02/luke/umui_runs/xkjgb-241170307/umuisubmit_run[349]: .: line 265: 29074: Terminated ATP Stack walkback for Rank 60 starting: [empty]@0x7ffff3f86c6f GNI_CqGetEvent@0x19c83b1 ATP Stack walkback for Rank 60 done Process died with signal 15: 'Terminated' Forcing core dumps of ranks 60, 24, 0, 2, 3, 4, 63, 66, 83, 67, 142 --------------------------------------------------------------------------------
Looking at the output of these diagnostics from my xkawe run, I can't see any non-sensible values.
These diagnostics are calculated in calc_div_ep_flux_mod, and passed back to eot_diag. Some fields are also calculated in st_diag3. I will try reducing the optimisation on all of these routines and see if that does anything.
This job uses a pre-compiled build, so calc_div_ep_flux_mod and eot_diag have already been built. I have noticed that st_diag3 links with gcom.
Thanks,
Luke
comment:3 Changed 6 years ago by luke
Using the following compiler override:
bld::tool::fflags::UM::atmosphere::climate_diagnostics::eot_diag %fflags64_mpp -O0 bld::tool::fflags::UM::atmosphere::climate_diagnostics::calc_div_ep_flux %fflags64_mpp -O0 bld::tool::fflags::UM::control::stash::st_diag3 %fflags64_mpp -I /work/n02/n02/hum/gcom/cce/gcom4.5/archer_cce_mpp/inc -h noomp -O0
I now get
=>> PBS: job killed: walltime 841 exceeded limit 800 aprun: Apid 9914828: Caught signal Terminated, sending to application Application 9914828 is crashing. ATP analysis proceeding... /home/n02/n02/luke/umui_runs/xkjgd-244123400/umuisubmit_run[349]: .: line 265: 9144: Terminated ATP Stack walkback for Rank 24 starting: _start@start.S:113 __libc_start_main@libc-start.c:242 flumemain_@flumeMain.f90:48 um_shell_@um_shell.f90:1865 u_model_@u_model.f90:2051 initial_@initial.f90:2610 initdump_@initdump.f90:3502 um_readdump_@um_readdump.f90:954 um_read_multi_@um_read_multi.f90:560 general_scatter_field_@general_scatter_field.f90:1098 stash_scatter_field_@stash_scatter_field.f90:955 gcg_ralltoalle_@gcg_ralltoalle.f90:180 gcg__ralltoalle_multi_@gcg_ralltoalle_multi.f90:398 mpl_waitall_@mpl_waitall.f90:48 MPI_WAITALL@0x18a0a9a PMPI_Waitall@0x18ced17 MPIR_Waitall_impl@0x18ce80a MPIDI_CH3I_Progress@0x18f8d14 ATP Stack walkback for Rank 24 done Process died with signal 15: 'Terminated' Forcing core dumps of ranks 24, 0, 3, 12, 19, 48, 58, 102, 47, 79
so it seems to have an issue when reading the dump. Running cumf on the final dump doesn't indicate a problem with any fields:
[14:02:50 luke@eslogin004 xkjgd]$ /work/n02/n02/hum/vn8.4/cce/utils/cumf xkjgda.da20081202_00 xkjgda.da20081202_00 CUMF successful Summary in: /work/n02/n02/luke/tmp/tmp.eslogin004.1376/cumf_summ.luke.d14244.t140314.40923 Full output in: /work/n02/n02/luke/tmp/tmp.eslogin004.1376/cumf_full.luke.d14244.t140314.40923 Difference maps (if available) in: /work/n02/n02/luke/tmp/tmp.eslogin004.1376/cumf_diff.luke.d14244.t140314.40923 COMPARE - SUMMARY MODE ----------------------- Number of fields in file 1 = 47392 Number of fields in file 2 = 47392 Number of fields compared = 47392 FIXED LENGTH HEADER: Number of differences = 0 INTEGER HEADER: Number of differences = 0 REAL HEADER: Number of differences = 0 LEVEL DEPENDENT CONSTANTS: Number of differences = 0 LOOKUP: Number of differences = 0 DATA FIELDS: Number of fields with differences = 0 files compare, ignoring Fixed Length Header
I'm not sure what else to try here.
Thanks,
Luke
comment:4 Changed 6 years ago by simon
Hi Luke,
I've had look at the fields in the dump and the problem fields appear to be a zonal fields. The STASHMaster has grid type 14 for these fields which indicates they are zonal at u points, which I assume is correct. In the umui the DP36CCM domain you're using is for full field data. There appears to a be DP36CCMZ domain which is configured for zonal data, maybe this'll work…
Having said that, there appears to be a routine called scatter_zonal_field.F90 but this is only called for zonal data at t points from stash_scatter_field.F90 , so perhaps the case of zonal u points hasn't been coded up yet.
Simon.
comment:5 Changed 6 years ago by luke
Hi Simon,
Thanks for taking a look at this.
I had looked at this as well. When I set the fields to this, I get this warning from STASH
Diag: "RESIDUAL MN MERID. CIRC. VSTARBAR " (30,310) (TDAYM,DP36CCMZ,UPA) DOMAIN PROF ERROR: Requested zonal mean, but there is no dimension to mean
which makes sense, as the 'full' field is actually already zonal - the zonal mean in STASH takes a 3D field and zonally means it. When running on MONSooN (with DP36CCM), the output is zonal already.
However, I never tried running with this to see what would happen. I'll do that now.
Thanks,
Luke
comment:6 Changed 6 years ago by luke
…and indeed it doesn't run:
mkdir:: File exists Application 9916270 is crashing. ATP analysis proceeding... ATP Stack walkback for Rank 135 starting: start_thread@pthread_create.c:301 _new_slave_entry@0x1825f9b gcg_r2darrsum__cray$mt$p0001@gcg_r2darrsum.f90:190 ATP Stack walkback for Rank 135 done Process died with signal 11: 'Segmentation fault' Forcing core dumps of ranks 135, 0, 1, 48 View application merged backtrace tree with: statview atpMergedBT.dot You may need to: module load stat _pmiu_daemon(SIGCHLD): [NID 01655] [c0-1c1s13n3] [Mon Sep 1 16:33:02 2014] PE RANK 120 exit signal Killed _pmiu_daemon(SIGCHLD): [NID 01650] [c0-1c1s12n2] [Mon Sep 1 16:33:02 2014] PE RANK 60 exit signal Killed _pmiu_daemon(SIGCHLD): [NID 01652] [c0-1c1s13n0] [Mon Sep 1 16:33:02 2014] PE RANK 84 exit signal Killed _pmiu_daemon(SIGCHLD): [NID 01651] [c0-1c1s12n3] [Mon Sep 1 16:33:02 2014] PE RANK 72 exit signal Killed _pmiu_daemon(SIGCHLD): [NID 01654] [c0-1c1s13n2] [Mon Sep 1 16:33:02 2014] PE RANK 108 exit signal Killed _pmiu_daemon(SIGCHLD): [NID 01656] [c0-1c1s14n0] [Mon Sep 1 16:33:02 2014] PE RANK 132 exit signal Killed _pmiu_daemon(SIGCHLD): [NID 01648] [c0-1c1s12n0] [Mon Sep 1 16:33:02 2014] PE RANK 36 exit signal Killed _pmiu_daemon(SIGCHLD): [NID 01647] [c0-1c1s11n3] [Mon Sep 1 16:33:02 2014] PE RANK 24 exit signal Killed _pmiu_daemon(SIGCHLD): [NID 01653] [c0-1c1s13n1] [Mon Sep 1 16:33:02 2014] PE RANK 96 exit signal Killed [NID 01651] 2014-09-01 16:33:02 Apid 9916270: initiated application termination xkjgd: Run failed
Thanks,
Luke
comment:7 Changed 6 years ago by grenville
Luke
Does your code have these:
I've tracked the issue down to a bug in scatter_field_gcom.F90. There is a pointer
INTEGER, POINTER :: send_map(:, ⇒ NULL()
Which should have a save attribute
INTEGER, SAVE, POINTER :: send_map(:, ⇒ NULL()
and
The
problem was in stash_gather_field - the declaration of receive_map needs
a SAVE attribute:
INTEGER, SAVE, POINTER :: receive_map(:,⇒NULL()
Grenville
[I don't know the smiley faces have appeared].
comment:8 Changed 6 years ago by grenville
The smilies should be colon right parens of course.
G
comment:9 Changed 6 years ago by grenville
I see that you have the first fix but not the second. Your ncas branch isn't the latest - we advise against using revision numbers with the ncas branch.
G
comment:10 Changed 6 years ago by luke
Hi Grenville,
Unfortunately this didn't solve the problem, and there are other issues with this particular model to do with the polar rows. I have a fix for the latter and so I will test this again in the new set up.
Thanks,
Luke
comment:11 Changed 6 years ago by luke
Hi Grenville,
I have tested this with a non-UKCA job xkolb, and this error still occurs, so it is not UKCA related, and is not associated with the problems that I have seen on the polar rows. Any more advice would be greatly appreciated.
Many thanks,
Luke
comment:12 Changed 6 years ago by luke
I have run a series of tests on ARCHER, with the following results:
All jobs are 6-day AMIP GA4.0 runs (with a job similar to amche) with daily dumps. The jobs are set up to run in two job-steps of 3 days each. The first 3 days on a NRUN and the final 3 on a CRUN. The jobs only output the following diagnostics (except for 1 test):
30 201 U COMPNT OF WIND ON P LEV/UV GRID 30 202 V COMPNT OF WIND ON P LEV/UV GRID 30 203 W COMPNT OF WIND ON P LEV/UV GRID 30 204 TEMPERATURE ON P LEV/UV GRID 30 301 HEAVYSIDE FN ON P LEV/UV GRID 30 310 RESIDUAL MN MERID. CIRC. VSTARBAR 30 311 RESIDUAL MN MERID. CIRC. WSTARBAR 30 312 ELIASSEN-PALM FLUX (MERID. COMPNT) 30 313 ELIASSEN-PALM FLUX (VERT. COMPNT) 30 314 DIVERGENCE OF ELIASSEN-PALM FLUX 30 315 MERIDIONAL HEAT FLUX 30 316 MERIDIONAL MOMENTUM FLUX
- xkolc: diagnostics sent to climate meaning (i.e. 3-day files) with the domain set to DP36CCM. Job timesout on NRUN with the following
ATP Stack walkback for Rank 0 starting: _start@start.S:113 __libc_start_main@libc-start.c:242 flumemain_@flumeMain.f90:48 um_shell_@um_shell.f90:1865 u_model_@u_model.f90:3597 meanctl_@meanctl.f90:3628 acumps_@acumps.f90:1292 general_gather_field_@general_gather_field.f90:1228 stash_gather_field_@stash_gather_field.f90:1078 gcg_ralltoalle_@gcg_ralltoalle.f90:180 gcg__ralltoalle_multi_@gcg_ralltoalle_multi.f90:398 mpl_waitall_@mpl_waitall.f90:48 MPI_WAITALL@0x161abfa PMPI_Waitall@0x1648de7 MPIR_Waitall_impl@0x16488da MPIDI_CH3I_Progress@0x16710b7 MPID_nem_gni_poll@0x1686f80 GNI_CqGetEvent@0x173a1ba ATP Stack walkback for Rank 0 done Process died with signal 15: 'Terminated'
- xkold: Same as xkolc but with the fields sent to UPA (a daily stream) with TDAYM. NRUN completes OK, but CRUN timesout with
ATP Stack walkback for Rank 0 starting: _start@start.S:113 __libc_start_main@libc-start.c:242 flumemain_@flumeMain.f90:48 um_shell_@um_shell.f90:1865 u_model_@u_model.f90:2048 initial_@initial.f90:2607 initdump_@initdump.f90:3499 um_readdump_@um_readdump.f90:954 um_read_multi_@um_read_multi.f90:579 gc_gsync_@gc_gsync.f90:135 mpl_barrier_@mpl_barrier.f90:43 pmpi_barrier@0x161ae3a PMPI_Barrier@0x161eca5 MPIR_Barrier_impl@0x161e928 MPIR_Barrier_or_coll_fn@0x161e47c MPIR_Barrier_intra@0x161e391 MPIC_Sendrecv_ft@0x164b289 MPIC_Sendrecv@0x164a75f MPIC_Wait@0x164a5ee MPIDI_CH3I_Progress@0x16710b7 MPID_nem_gni_poll@0x1686e71 MPID_nem_gni_check_localCQ@0x1685b91 GNI_CqGetEvent@0x173a1df GNII_DlaProgress@0x173dad1 ATP Stack walkback for Rank 0 done Process died with signal 15: 'Terminated'
- xkole: same as xkold but with 30,201-204 and 30,301 sent to DP36CCMZ (i.e. zonal mean - 30,310-316 are defined as zonal mean fields and so shouldn't be zonally meaned). NRUN completes OK, but CRUN timesout with
ATP Stack walkback for Rank 24 starting: _start@start.S:113 __libc_start_main@libc-start.c:242 flumemain_@flumeMain.f90:48 um_shell_@um_shell.f90:1865 u_model_@u_model.f90:2048 initial_@initial.f90:2607 initdump_@initdump.f90:3499 um_readdump_@um_readdump.f90:954 um_read_multi_@um_read_multi.f90:560 general_scatter_field_@general_scatter_field.f90:1098 stash_scatter_field_@stash_scatter_field.f90:955 gcg_ralltoalle_@gcg_ralltoalle.f90:180 gcg__ralltoalle_multi_@gcg_ralltoalle_multi.f90:398 mpl_waitall_@mpl_waitall.f90:48 MPI_WAITALL@0x161abfa PMPI_Waitall@0x1648de7 MPIR_Waitall_impl@0x16488da MPIDI_CH3I_Progress@0x16710b7 MPID_nem_gni_poll@0x1686e71 MPID_nem_gni_check_localCQ@0x1685b91 GNI_CqGetEvent@0x173a1df GNII_DlaProgress@0x173db04 ATP Stack walkback for Rank 24 done Process died with signal 15: 'Terminated'
- xkolf: same as xkole but sent to T6HDAYM rather than TDAYM. NRUN completes OK, but CRUN timesout with
ATP Stack walkback for Rank 48 starting: _start@start.S:113 __libc_start_main@libc-start.c:242 flumemain_@flumeMain.f90:48 um_shell_@um_shell.f90:1865 u_model_@u_model.f90:2048 initial_@initial.f90:2607 initdump_@initdump.f90:3499 um_readdump_@um_readdump.f90:954 um_read_multi_@um_read_multi.f90:560 general_scatter_field_@general_scatter_field.f90:1098 stash_scatter_field_@stash_scatter_field.f90:955 gcg_ralltoalle_@gcg_ralltoalle.f90:180 gcg__ralltoalle_multi_@gcg_ralltoalle_multi.f90:398 mpl_waitall_@mpl_waitall.f90:48 MPI_WAITALL@0x161abfa PMPI_Waitall@0x1648de7 MPIR_Waitall_impl@0x16488da MPIDI_CH3I_Progress@0x16710b7 MPID_nem_gni_poll@0x1686e71 MPID_nem_gni_check_localCQ@0x1685b91 GNI_CqGetEvent@0x173a217 ATP Stack walkback for Rank 48 done Process died with signal 15: 'Terminated'
- xkolg: same as xkold (TDAYM, UPA) but with 30,310-316 removed. This job completes the NRUN and CRUN steps, so the issue is isolated to this set of diagnostics.
- xkolh: same as xkold, but with diagnostics sent to T6H rather than TDAYM. This job completes the NRUN and CRUN steps. This means that the meaning of the diagnostics (30,310-316) seems to be the issue, rather than just outputting them.
The partial solution provided but xkolh (T6H) means that it is possible to get some output from these diagnostics, but it isn't ideal as lots of extra output will be produced, and more post-processing will be required.
It's also a little worrying as it may be highlighting a more general issue that hasn't made itself evident before.
Any further advice would be welcomed.
Many thanks,
Luke
comment:13 Changed 5 years ago by luke
- Resolution set to wontfix
- Status changed from new to closed
It seems that this problem is impossible to solve, other than doing the T6H output option specified in option 6 above.
Hi Luke,
There's a couple of things you could try, but they're both clutching at straws.
1) In um_shell, there's a call to gc_setopt which sets gc_alltoall_version to gc_alltoall_multi, try changing this to gc_alltoall_orig . This then uses the old version of the alltoall code in gcom.
2) Try changing the value of GCOM collectives limit from 1 to a number larger than the number of PEs you are running on. This also changes the scatter/gather method. Be warmed, that this could slow your job to a crawl and therefore won't be much use.
Or the fields could be full of crap which is causing issues. Or there's something odd with the pre-STASH file for the diags which means there is inconsistent metadata for the fields inside the model.
Simon.