Opened 18 months ago
Closed 18 months ago
#2490 closed help (fixed)
Wall time exceeded error in recon job
Reported by: | amenon | Owned by: | um_support |
---|---|---|---|
Component: | UM Model | Keywords: | recon, walltime exceeded |
Cc: | Platform: | ARCHER | |
UM Version: | 10.9 |
Description (last modified by willie)
Hi CMS,
This is a continuation of the ticket #2428. The reconfiguration job in u-ay368 (which is a copy of the suite u-av692 in ticket #2428, but with Andy Malcom's optimisation branch switched off) is still failing with the following error:
???????????????????????????????????????????????????????????????????????????????? =>> PBS: job killed: walltime 10890 exceeded limit 10800 aprun: Apid 31052932: Caught signal Terminated, sending to application Terminated Received signal TERM /work/n02/n02/amenon/cylc-run/u-ay368/share/fcm_make/build-recon/bin/um-recon: line 118: 11850 Terminated rose mpi-launch -v $COMMAND _pmiu_daemon(SIGCHLD): [NID 00226] [c1-0c0s8n2] [Tue Jun 5 16:20:53 2018] PE RANK 23 exit signal Terminated _pmiu_daemon(SIGCHLD): [NID 00225] [c1-0c0s8n1] [Tue Jun 5 16:20:53 2018] PE RANK 4 exit signal Terminated _pmiu_daemon(SIGCHLD): [NID 00246] [c1-0c0s13n2] [Tue Jun 5 16:20:53 2018] PE RANK 72 exit signal Terminated _pmiu_daemon(SIGCHLD): [NID 00263] [c1-0c1s1n3] [Tue Jun 5 16:20:53 2018] PE RANK 120 exit signal Terminated _pmiu_daemon(SIGCHLD): [NID 00260] [c1-0c1s1n0] [Tue Jun 5 16:20:53 2018] PE RANK 108 exit signal Terminated _pmiu_daemon(SIGCHLD): [NID 00267] [c1-0c1s2n3] [Tue Jun 5 16:20:53 2018] PE RANK 132 exit signal Terminated _pmiu_daemon(SIGCHLD): [NID 00316] [c1-0c1s15n0] [Tue Jun 5 16:20:53 2018] PE RANK 252 exit signal Terminated _pmiu_daemon(SIGCHLD): [NID 00236] [c1-0c0s11n0] [Tue Jun 5 16:20:53 2018] PE RANK 60 exit signal Terminated _pmiu_daemon(SIGCHLD): [NID 00305] [c1-0c1s12n1] [Tue Jun 5 16:20:53 2018] PE RANK 195 exit signal Terminated _pmiu_daemon(SIGCHLD): [NID 00315] [c1-0c1s14n3] [Tue Jun 5 16:20:53 2018] PE RANK 240 exit signal Terminated _pmiu_daemon(SIGCHLD): [NID 00313] [c1-0c1s14n1] [Tue Jun 5 16:20:53 2018] PE RANK 228 exit signal Terminated _pmiu_daemon(SIGCHLD): [NID 00310] [c1-0c1s13n2] [Tue Jun 5 16:20:53 2018] PE RANK 204 exit signal Terminated _pmiu_daemon(SIGCHLD): [NID 00273] [c1-0c1s4n1] [Tue Jun 5 16:20:53 2018] PE RANK 156 exit signal Terminated _pmiu_daemon(SIGCHLD): [NID 00257] [c1-0c1s0n1] [Tue Jun 5 16:20:53 2018] PE RANK 84 exit signal Terminated _pmiu_daemon(SIGCHLD): [NID 00235] [c1-0c0s10n3] [Tue Jun 5 16:20:53 2018] PE RANK 48 exit signal Terminated _pmiu_daemon(SIGCHLD): [NID 00312] [c1-0c1s14n0] [Tue Jun 5 16:20:53 2018] PE RANK 216 exit signal Terminated _pmiu_daemon(SIGCHLD): [NID 00301] [c1-0c1s11n1] [Tue Jun 5 16:20:53 2018] PE RANK 180 exit signal Terminated _pmiu_daemon(SIGCHLD): [NID 00297] [c1-0c1s10n1] [Tue Jun 5 16:20:53 2018] PE RANK 168 exit signal Terminated _pmiu_daemon(SIGCHLD): [NID 00259] [c1-0c1s0n3] [Tue Jun 5 16:20:53 2018] PE RANK 96 exit signal Terminated _pmiu_daemon(SIGCHLD): [NID 00227] [c1-0c0s8n3] [Tue Jun 5 16:20:53 2018] PE RANK 24 exit signal Terminated _pmiu_daemon(SIGCHLD): [NID 00272] [c1-0c1s4n0] [Tue Jun 5 16:20:53 2018] PE RANK 144 exit signal Terminated _pmiu_daemon(SIGCHLD): [NID 00234] [c1-0c0s10n2] [Tue Jun 5 16:20:53 2018] PE RANK 36 exit signal Terminated cylc (scheduler - 2018-06-05T16:20:53Z): CRITICAL Task job script received signal TERM at 2018-06-05T16:20:53Z cylc (scheduler - 2018-06-05T16:20:53Z): CRITICAL failed at 2018-06-05T16:20:53Z
Switching off the optimisation branch was a test I did to get over this error, but with no luck. Earlier, I also tried by increasing the processors to 22X12 as Willie suggested. But that too didn't work. Could you please advise on any possible solutions for this?
Many thanks
Change History (15)
comment:1 Changed 18 months ago by grenville
comment:2 Changed 18 months ago by willie
- Description modified (diff)
comment:3 Changed 18 months ago by willie
Hi Arathy,
The problem is in the INCOMPASS_km4p4_RA1T_um_recon task. Although it has 'run' for three hours absolutely nothing has been written out. This could be a problem with the dump that it is trying to reconfigure. I can't see where that file is, could you let me know.
Regards
Willie
comment:4 Changed 18 months ago by willie
OK, just found it in the RECONA file. I had a look and the dump is NaN free and readable in xconv.
Willie
comment:5 Changed 18 months ago by willie
The ancillary files listed in the SHARED file are also NaN free. I notice that the RA1T_astart file in /work/n02/n02/amenon/cylc-run/u-ay368/share/cycle/20160701T0000Z/INCOMPASS/km4p4/RA1T/ics has zero length.
Willie
comment:6 Changed 18 months ago by willie
OK, In SHARED we have, in the name list recon_technical,
ainitial='/work/n02/n02/amenon/cylc-run/u-ay368/share/cycle/20160701T0000Z/glm/ics/glm_astart'
and in RECONA, we have
astart='/work/n02/n02/amenon/cylc-run/u-ay368/share/cycle/20160701T0000Z/INCOMPASS/km4p4/RA1T/ics/RA1T_astart'
and this is a zero length file. So something is wrong here.
Willie
comment:7 Changed 18 months ago by willie
The model is
$AINITIAL —> Reconfig exec —> $ASTART
so this shows it started to write the output dump and then hung.
In the um app, go to IO System Settings → Print Manager control and switch on pmt_force_flush.
In um → env → Runtime Controls → Atmosphere only, set PRINT STATUS to extra diagnostics messages.
Do this in addition to Grenville's changes and then run again.
Regards
Willie
comment:8 Changed 18 months ago by amenon
Thanks a lot Willie and Grenville. I was in Bristol for a training yesterday. I will try these and will get back to you soon.
comment:9 Changed 18 months ago by amenon
Hi,
I made the above changes. It failed with the following description:
atpFrontend.exe: main: Build of MRNet network failed 'MRNet: Network failure'
Regards,
Arathy
comment:10 Changed 18 months ago by grenville
H Arathy
It appears that you ancillary data (land-sea mask etc) is not on the same rotated grid as your RA1T domain. You probably need to regenerate the ancil files of the (20, -75) rotated pole. {they are currently on (90,180))
Grenville
comment:11 Changed 18 months ago by grenville
Hi Arathy
Pl let us know if Stu's suggestions worked - we'll delve into the reconfig in the meantime.
Grenville
comment:12 Changed 18 months ago by ros
- Owner changed from amenon to um_support
- Status changed from new to assigned
comment:13 Changed 18 months ago by grenville
Hi Arathy
This works OK if instead of using the spiral or circle coast adjustment method, the standard method is used - search for coast_adj_method — set it to standard for the nested UM.
Why — I don't know - the reconfigurations stalls when handling the ice fraction with spiral or circle methods.
I ran on 4 nodes in less than 20 mins.
Grenville
comment:14 Changed 18 months ago by amenon
It worked for me too. Thanks a lot Grenville.
comment:15 Changed 18 months ago by grenville
- Resolution set to fixed
- Status changed from assigned to closed
Please set
RCF_PRINTSTATUS to "Extra diagnostic messages", and add
ATP_ENABLED = 1
to
[runtime]
ATP_ENABLED = 1
reduce the wallclock to say 10 mins & try again