Opened 3 months ago

Closed 2 months ago

#2844 closed help (fixed)

Job killed: walltime exceeded limit with no output

Reported by: eeac Owned by: um_support
Component: UM Model Keywords:
Cc: HadGEM3-A Platform: ARCHER
UM Version: 8.4

Description (last modified by willie)

Hi,

After being able to compile the model related to ticket #2833, I cannot get job qaaga to run successfully.

I'm getting the following wall time error as seen in /home/n02/n02/eeac/output leave files (see the latest one) but nothing specific which makes it difficult to debug.

⇒> PBS: job killed: walltime 1839 exceeded limit 1800
/var/spool/PBS/mom_priv/jobs/6098256.sdb.SC[352]: .: line 265: 10127: Terminated

I've tried various job time limits with no luck.

There is also the issue of no model output and essentially no information what might be wrong. This could be related to the fact that I played around with switching on/off the archiving and the automatic post-processing as I thought #2820 had similar issues with my run, but either way it had no output.

I can't quite understand the output of the .out file of the file (work directory) :

===== UM RUN OUTPUT =====
qsatmos: %MODEL% output follows:-
qsatmos: Stack requested for UM job: GB

I'd appreciate your help with this!

This follows on from #2833

Cheers,
Andreas

Change History (11)

comment:1 Changed 3 months ago by grenville

Andreas

You need to include fcm:um_br/pkg/Config/vn8.4_ncas/src in the build. You will also need to fix the start time of the model to match the start time of the dump (you can set the start time to all zeros and it'll pick up time in the dump).

But I can't recommend using this old model version - it may not survive the transition to ARCHER2 in Q1 2020. Can you not find a more recent version - the most recent would be best (UM 11.3?)

Grenville

comment:2 Changed 3 months ago by eeac

Hi Grenville,

Thanks for this, I'll give it a try as soon as possible and get back to you (might be next week).

The reason why I opted for this rather old version was that we had a rather easy set up for me to edit and run. The planned simulations won't take too much time so I think if all goes well there won't be a reason to run this version again. There is also the fact that the UM >10.x is a bit more expensive in my understanding.

Since I'd like to run a piControl run atmos only (N96L85), changing only the SSTs and SIC and then perturb the CO2 concentration in a few sensitivity runs do you know of any available job I can take a look and perhaps edit that? I would also need a relevant start dump in that case.

Cheers,
Andreas

comment:3 Changed 2 months ago by eeac

Implementing Grenville's vn8.4_ncas build the model was able to start the run only to produce a UM_SETUP error which follows:

Error in routine: UM_SETUP
Error Code: 2
Error Message: READHK : Failed in OPEN of input unit
Error generated from processor: 0
This run generated 5 warnings

That points to an error reading the UM housekeeping files, which I cannot pinpoint where exactly it's coming from.

The full name leave file in the output dir is : qaaga000.qaaga.d19094.t083604.leave

Since I'm changing the SST and SIC ancillaries I tested if those are causing the issue (job qaagb) by changing them to default climatologies, only to yield the same error.

Any suggestions please?
Andreas

comment:4 Changed 2 months ago by willie

Hi Andreas,

This file is only needed by Met Office operational runs. Your job is not operational in that sense. The UMUI deems that it is operational because your RUNID, qaaga, begins with a q!

I think you can get round this by editing the umuisubmit_rcf and umuisubmit_run files to set OPERATIONAL=true to OPERATIONAL=false.

So run your job from the UMUI as usual and let it fail. On ARCHER in the umui_runs directory, find the latest version of the job. Go into that and edit umuisubmit_run. Then,

 qsub umuisubmit_run

You need to do this each time you want to do a run.

Willie

comment:5 Changed 2 months ago by eeac

Hi Willie,

Thanks for this, I believe when I copied this job initially I've made it operational without any understanding of what that means. So, in order to avoid the editing you suggested before each run, I thought of copying this job (qaaga) to a new experiment and get a new job id- xokfb (in this case not operational). I thought this might bypass the problem with it being operational and crashing?

The copied job xokfb manages to run for a minute and a half but it crashes with the following error:

ATP Stack walkback for Rank 24 starting:
bi_linear_h_@…:505
ATP Stack walkback for Rank 24 done
Process died with signal 11: 'Segmentation fault'
Forcing core dumps of ranks 24, 0, 1, 13, 97
View application merged backtrace tree with: stat-view atpMergedBT.dot
You may need to: module load stat
_pmiu_daemon(SIGCHLD): [NID 04319] [c6-2c1s7n3] [Mon Apr 8 19:33:56 2019] PE RANK 71 exit signal Killed
_pmiu_daemon(SIGCHLD): [NID 04662] [c0-3c0s13n2] [Mon Apr 8 19:33:56 2019] PE RANK 109 exit signal Killed
_pmiu_daemon(SIGCHLD): [NID 02418] [c4-1c1s12n2] [Mon Apr 8 19:33:56 2019] PE RANK 49 exit signal Killed
_pmiu_daemon(SIGCHLD): [NID 04661] [c0-3c0s13n1] [Mon Apr 8 19:33:56 2019] PE RANK 102 exit signal Killed

Unfortunately, I cannot find any ERROR in those core dumps. Would you mind helping me with this?

Cheers,
Andreas

comment:6 Changed 2 months ago by willie

Hi Andreas,

The .leave file shows that it has failed to converge in the first time step. The start dump is free from NaNs. So this suggests a problem with the initial conditions e.g. the ancillary files. But I notice you are not using reconfiguration, so you could try that too.

Willie

comment:7 Changed 2 months ago by willie

  • Description modified (diff)

comment:8 Changed 2 months ago by eeac

Hi Willie,

Thanks for this, as you correctly suggested the problem was caused by the ancillaries I created. Using default SST/SIC ancillaries has solved this issue without having to use reconfiguration.

I'll try to get to the bottom of the issue with the ancillaries and create a new ticket if I can't get around it.

Cheers,
Andreas

comment:9 Changed 2 months ago by eeac

Just a side-note on your alternative suggestion, switching on the reconfiguration leads to an error as I can see in the leave file (UMRECON builds OK though)
/home/n02/n02/eeac/output/xokfc000.xokfc.d19099.t173148.comp.leave

qsub: Unknown queue

I've managed to change the date of the dump with the change_dump_date util so I've sorted the date issue.

I understand that I don't necessarily need to reconfigure since I haven't added any prognostic variables compared to the run I copied as I can initiate the run from an existing dump as I'm doing currently successfully.

But out of curiosity why would the reconfiguration fail?

Cheers,
Andreas

comment:10 Changed 2 months ago by willie

Hi Andreas,

One of the hand edits,

~ukca/hand_edits/VN8.4/ARCHER_debug_rcf.ed

selects the debug queue, but this no longer exists, being replaced by the short queue - see http://www.archer.ac.uk/documentation/user-guide/batch.php#sec-5.14. So just create a new hand edit to deal with this.

I'll close this ticket now.

Willie

comment:11 Changed 2 months ago by willie

  • Resolution set to fixed
  • Status changed from new to closed
Note: See TracTickets for help on using tickets.