Opened 4 months ago

Closed 4 months ago

#2490 closed help (fixed)

Wall time exceeded error in recon job

Reported by: amenon Owned by: um_support
Priority: normal Component: UM Model
Keywords: recon, walltime exceeded Cc:
Platform: ARCHER UM Version: 10.9

Description (last modified by willie)

Hi CMS,

This is a continuation of the ticket #2428. The reconfiguration job in u-ay368 (which is a copy of the suite u-av692 in ticket #2428, but with Andy Malcom's optimisation branch switched off) is still failing with the following error:

????????????????????????????????????????????????????????????????????????????????

 

=>> PBS: job killed: walltime 10890 exceeded limit 10800

aprun: Apid 31052932: Caught signal Terminated, sending to application

Terminated

Received signal TERM

/work/n02/n02/amenon/cylc-run/u-ay368/share/fcm_make/build-recon/bin/um-recon: line 118: 11850 Terminated              rose mpi-launch -v $COMMAND

_pmiu_daemon(SIGCHLD): [NID 00226] [c1-0c0s8n2] [Tue Jun  5 16:20:53 2018] PE RANK 23 exit signal Terminated

_pmiu_daemon(SIGCHLD): [NID 00225] [c1-0c0s8n1] [Tue Jun  5 16:20:53 2018] PE RANK 4 exit signal Terminated

_pmiu_daemon(SIGCHLD): [NID 00246] [c1-0c0s13n2] [Tue Jun  5 16:20:53 2018] PE RANK 72 exit signal Terminated

_pmiu_daemon(SIGCHLD): [NID 00263] [c1-0c1s1n3] [Tue Jun  5 16:20:53 2018] PE RANK 120 exit signal Terminated

_pmiu_daemon(SIGCHLD): [NID 00260] [c1-0c1s1n0] [Tue Jun  5 16:20:53 2018] PE RANK 108 exit signal Terminated

_pmiu_daemon(SIGCHLD): [NID 00267] [c1-0c1s2n3] [Tue Jun  5 16:20:53 2018] PE RANK 132 exit signal Terminated

_pmiu_daemon(SIGCHLD): [NID 00316] [c1-0c1s15n0] [Tue Jun  5 16:20:53 2018] PE RANK 252 exit signal Terminated

_pmiu_daemon(SIGCHLD): [NID 00236] [c1-0c0s11n0] [Tue Jun  5 16:20:53 2018] PE RANK 60 exit signal Terminated

_pmiu_daemon(SIGCHLD): [NID 00305] [c1-0c1s12n1] [Tue Jun  5 16:20:53 2018] PE RANK 195 exit signal Terminated

_pmiu_daemon(SIGCHLD): [NID 00315] [c1-0c1s14n3] [Tue Jun  5 16:20:53 2018] PE RANK 240 exit signal Terminated

_pmiu_daemon(SIGCHLD): [NID 00313] [c1-0c1s14n1] [Tue Jun  5 16:20:53 2018] PE RANK 228 exit signal Terminated

_pmiu_daemon(SIGCHLD): [NID 00310] [c1-0c1s13n2] [Tue Jun  5 16:20:53 2018] PE RANK 204 exit signal Terminated

_pmiu_daemon(SIGCHLD): [NID 00273] [c1-0c1s4n1] [Tue Jun  5 16:20:53 2018] PE RANK 156 exit signal Terminated

_pmiu_daemon(SIGCHLD): [NID 00257] [c1-0c1s0n1] [Tue Jun  5 16:20:53 2018] PE RANK 84 exit signal Terminated

_pmiu_daemon(SIGCHLD): [NID 00235] [c1-0c0s10n3] [Tue Jun  5 16:20:53 2018] PE RANK 48 exit signal Terminated

_pmiu_daemon(SIGCHLD): [NID 00312] [c1-0c1s14n0] [Tue Jun  5 16:20:53 2018] PE RANK 216 exit signal Terminated

_pmiu_daemon(SIGCHLD): [NID 00301] [c1-0c1s11n1] [Tue Jun  5 16:20:53 2018] PE RANK 180 exit signal Terminated

_pmiu_daemon(SIGCHLD): [NID 00297] [c1-0c1s10n1] [Tue Jun  5 16:20:53 2018] PE RANK 168 exit signal Terminated

_pmiu_daemon(SIGCHLD): [NID 00259] [c1-0c1s0n3] [Tue Jun  5 16:20:53 2018] PE RANK 96 exit signal Terminated

_pmiu_daemon(SIGCHLD): [NID 00227] [c1-0c0s8n3] [Tue Jun  5 16:20:53 2018] PE RANK 24 exit signal Terminated

_pmiu_daemon(SIGCHLD): [NID 00272] [c1-0c1s4n0] [Tue Jun  5 16:20:53 2018] PE RANK 144 exit signal Terminated

_pmiu_daemon(SIGCHLD): [NID 00234] [c1-0c0s10n2] [Tue Jun  5 16:20:53 2018] PE RANK 36 exit signal Terminated

cylc (scheduler - 2018-06-05T16:20:53Z): CRITICAL Task job script received signal TERM at 2018-06-05T16:20:53Z

cylc (scheduler - 2018-06-05T16:20:53Z): CRITICAL failed at 2018-06-05T16:20:53Z

Switching off the optimisation branch was a test I did to get over this error, but with no luck. Earlier, I also tried by increasing the processors to 22X12 as Willie suggested. But that too didn't work. Could you please advise on any possible solutions for this?

Many thanks

Change History (15)

comment:1 Changed 4 months ago by grenville

Please set

RCF_PRINTSTATUS to "Extra diagnostic messages", and add

ATP_ENABLED = 1

to
[runtime]

root?

[environment?

ATP_ENABLED = 1

reduce the wallclock to say 10 mins & try again

comment:2 Changed 4 months ago by willie

  • Description modified (diff)

comment:3 Changed 4 months ago by willie

Hi Arathy,

The problem is in the INCOMPASS_km4p4_RA1T_um_recon task. Although it has 'run' for three hours absolutely nothing has been written out. This could be a problem with the dump that it is trying to reconfigure. I can't see where that file is, could you let me know.

Regards
Willie

comment:4 Changed 4 months ago by willie

OK, just found it in the RECONA file. I had a look and the dump is NaN free and readable in xconv.

Willie

comment:5 Changed 4 months ago by willie

The ancillary files listed in the SHARED file are also NaN free. I notice that the RA1T_astart file in /work/n02/n02/amenon/cylc-run/u-ay368/share/cycle/20160701T0000Z/INCOMPASS/km4p4/RA1T/ics has zero length.

Willie

comment:6 Changed 4 months ago by willie

OK, In SHARED we have, in the name list recon_technical,

ainitial='/work/n02/n02/amenon/cylc-run/u-ay368/share/cycle/20160701T0000Z/glm/ics/glm_astart'

and in RECONA, we have

astart='/work/n02/n02/amenon/cylc-run/u-ay368/share/cycle/20160701T0000Z/INCOMPASS/km4p4/RA1T/ics/RA1T_astart'

and this is a zero length file. So something is wrong here.

Willie

comment:7 Changed 4 months ago by willie

The model is

$AINITIAL —> Reconfig exec —> $ASTART

so this shows it started to write the output dump and then hung.

In the um app, go to IO System Settings → Print Manager control and switch on pmt_force_flush.

In um → env → Runtime Controls → Atmosphere only, set PRINT STATUS to extra diagnostics messages.

Do this in addition to Grenville's changes and then run again.

Regards
Willie

comment:8 Changed 4 months ago by amenon

Thanks a lot Willie and Grenville. I was in Bristol for a training yesterday. I will try these and will get back to you soon.

comment:9 Changed 4 months ago by amenon

Hi,
I made the above changes. It failed with the following description:
atpFrontend.exe: main: Build of MRNet network failed 'MRNet: Network failure'

Regards,
Arathy

comment:10 Changed 4 months ago by grenville

H Arathy

It appears that you ancillary data (land-sea mask etc) is not on the same rotated grid as your RA1T domain. You probably need to regenerate the ancil files of the (20, -75) rotated pole. {they are currently on (90,180))

Grenville

comment:11 Changed 4 months ago by grenville

Hi Arathy

Pl let us know if Stu's suggestions worked - we'll delve into the reconfig in the meantime.

Grenville

comment:12 Changed 4 months ago by ros

  • Owner changed from amenon to um_support
  • Status changed from new to assigned

comment:13 Changed 4 months ago by grenville

Hi Arathy

This works OK if instead of using the spiral or circle coast adjustment method, the standard method is used - search for coast_adj_method — set it to standard for the nested UM.

Why — I don't know - the reconfigurations stalls when handling the ice fraction with spiral or circle methods.

I ran on 4 nodes in less than 20 mins.

Grenville

comment:14 Changed 4 months ago by amenon

It worked for me too. Thanks a lot Grenville.

comment:15 Changed 4 months ago by grenville

  • Resolution set to fixed
  • Status changed from assigned to closed
Note: See TracTickets for help on using tickets.