Opened 6 years ago

Closed 5 years ago

#1281 closed help (fixed)

quota // compiled job submission // long compile times

Reported by: cwright Owned by: willie
Component: ARCHER Keywords: quota, compile time, submission
Cc: Platform: ARCHER
UM Version: 6.6.3

Description

Hi,

I'm having a few problems with my job submissions, which I've put in one message in case they're related.


Quota


I seem to be hitting my space quota limit on Archer very fast - am I doing something wrong, or is it quite a small quota? If the latter, could I increase it please? I seem to be hitting the limit when I try to compile a fourth job - three is fine, but the fourth fails to compile due to lack of space. I think my quota on Hector was 1TB; my current Archer quota according to SAFE appears to be 10GB.


Pre-compiled submissions failing


When I submit already-compiled jobs (i.e. option 1 in subindep_compile), the submission fails with the error message "qsub: Archer: Please specify walltime". See e.g. job xiwxz. I suspect I'm missing some kind of default job submission setting? This has been happening since I moved to Archer, so I've probably botched reconfiguring my system files (eg .login, etc) after the transfer!


Very slow compilations


Conpilation is very slow and is often failing due to overrunning the default hour. In particular, seven of my last eight submitted jobs have failed in this way. There's nothing special about these jobs - they're exactly the same as ones which are compiling fine, and I can't see any obvious pattern to the failed ones except that they seem to happen in clusters (e.g. I had three or four in a row this evening). See e.g.,

~cwright/um/umui_out/xiwxg000.xiwxg.d14112.t183853.comp.leave

and

~cwright/um/umui_out/xiwxg000.xiwxg.d14112.t160146.comp.leave

The former hit the limit at one hour, whilst the latter compiled in around 15 minutes, but there was no difference between the submitted jobs - if you check the edit history of xiwxg, you'll see I made no changes between the times these jobs were submitted. Bit puzzled!

Change History (6)

comment:1 Changed 6 years ago by willie

  • Owner changed from um_support to willie
  • Status changed from new to accepted

Hi Corwin,

You should follow carefully the instructions at http://cms.ncas.ac.uk/wiki/Archer and in the ARCHER website it references. You need to compile your code for ARCHER.

Several users have reported long compile times. I have gotten round the problem by increasing the compile time to 4 hours.

I hope that helps.

Regards

Willie

comment:2 Changed 6 years ago by cwright

Hi Willie,

I've already done everything on that page, with the exception of the change to make the number of cores a multiple of 24, which I'll do now - I think this just relates to cost-efficiency of the actual run rather than compilation though, as the compilation is done on the serial nodes?

I tried increasing compile time to four hours, but that didn't resolve the problem - I set xixwg and xiwxb compiling for four hours before I went to bed, and both timed out before completing compile. Compile times seem to be extremely erratic as well. As an example, here are ten compiles of xiwxg over the last few days. The job is identical in each case, and is fairly simple - it's a 6.6.3 run using default atmospheric behaviour and daily surface forcings (SST/ice) from a file in ~/work/ancil.

Time allowed (h:mm) / time taken (h:mm) / result

4.00 / 4.00 / timed out
4.00 / 2.53 / compiled
2.00 / 2.00 / timed out
1.00 / 0.09 / compiled
1.00 / 0.08 / failed
1.00 / 0.10 / failed
1.00 / 1.00 / timed out
1.00 / 1.00 / timed out
1.00 / 0.52 / compiled
1.00 / 1.00 / timed out

As regards resubmission of already-compiled jobs: I guess I need to add something to some script which will tell qsub the default walltime, since the UMUI clearly isn't passing it on. I'm not sure where to put this though?

comment:3 Changed 6 years ago by cwright

Juast an additional note that I've now set xixwg going again in a 12x10 processor configuration (i.e. 24x5) - previously 16x8 (32x4). I'll let you know if this makes any difference to the compile times.

comment:4 Changed 6 years ago by willie

Hi Corwin,

Although you have set the model to run only, you have selected compile for the reconfiguration. This combination is not allowed for this version of the model. If you want to compile, then compile both at the same time.

I have reported the compile time problem to ARCHER along with your statistics.

Regards,

Willie

comment:5 Changed 6 years ago by cwright

Hi,

thanks for the advice about having to turn off reconfiguration at the same time as turning off compilation - I hadn't realised that, so it's sped things up a fair bit! The random-length compile times are still proving a difficulty when I do need to recompile - I'm still often going over 4 hours - but hopefully Archer will be able to fix that in the nearish future.

I'm having another problem with nudged runs, but it might not be related, so I've opened it as a new ticket: http://cms.ncas.ac.uk/ticket/1287#ticket

Sorry for the slow reply - truly hectic week!

Corwin

comment:6 Changed 5 years ago by willie

  • Resolution set to fixed
  • Status changed from accepted to closed
Note: See TracTickets for help on using tickets.