Opened 8 months ago

Closed 8 months ago

#2780 closed help (fixed)

out of cpu resources u-be699

Reported by: xd904476 Owned by: um_support
Component: UM Model Keywords:
Cc: Platform: ARCHER
UM Version: 10.7

Description

Hi,
I have just looked into my suite (u-b699) and I can see that there are a few problems regarding resources

[job-submit cmd] cylc jobs-submit —host=login2.archer.ac.uk —user=dflocco —remote-mode — '$HOME/cylc-run/u-be699/log/job' 20140101T0000Z/coupled/10
[job-submit ret_code] 188
[job-submit out] 2019-02-14T03:42:27Z|20140101T0000Z/coupled/10|188|None
(dflocco@…) 2019-02-14T03:42:27Z [STDERR] qsub: Job exceeds queue and/or server resource limits
~
~
~
~

I have also looked at my jasmin directory where the files should be stored and only the files until april are there, while the model seems to have run for longer, but this may be just because a full year has not been produced yet, therefore the dumping frequency has not yet been reached.
I am not sure whether I am out of space of cpu resources somehow ( I had 16mAus the other day). Perhaps I have not setup the model in such a way that it will use these resources? Or I am writing files which are too big?
thanks,
Dani

Change History (14)

comment:1 Changed 8 months ago by grenville

Dani

This means your request for nodes and time is not appropriate for the queue you're submitting to
/home/n02/n02/dflocco/cylc-run/u-be699/log/job/20140101T0000Z/coupled/10/job

see #PBS -l walltime=24:01:00

the suite adds 1 minute (I don't recall why) - set your job time to be 23hrs 58mins or less

Grenville

comment:2 Changed 8 months ago by xd904476

Thank you. If I change the walltime to 23.57, can I just retrigger the model to run or shall I recompile in any way?

Thanks,
Dani

comment:3 Changed 8 months ago by grenville

change the time in the suite on puma, then in the puma suite directory

rose suite-run —reload

then re-trigger the task

comment:4 Changed 8 months ago by xd904476

thanks

comment:5 Changed 8 months ago by xd904476

Hi Grenville,
I have changed the walltime to 23:57 and also the wallclock time to 2:59h (rather than 3h), but the model stil doesn't pass the coupled state.
The same suite was running a few days back, but I had it cycling too often for a long control run.
I am not sure about what to change to make it run again.
Thanks,
dani

comment:6 Changed 8 months ago by grenville

Dani

You have 1 thread but have requested IO servers - either use 2 threads or set number of IO servers to zero.

Your CICE configuration is also not right for 12x8 processors - it is right for 9x8 processors.

Grenville

comment:7 Changed 8 months ago by xd904476

Hi Grenville, thanks a lot. What would be the best thing then? change the CICE configuration? or the NEMO number of processors? Should I change the atmosphere too then?
Thanks,
Dani

comment:8 Changed 8 months ago by grenville

Dani

I doubt IO servers are doing much at this resolution (I don't know for sure), so I'd say set the the number of IO servers to zero, and use 9x8 processors for the ocean.

However I don't know how close to 23:57 that configuration will get with its 12 month cycling? If you are not sure, you could drop that down to 9 months maybe?

Better still to test it for a month.

Grenville

comment:9 Changed 8 months ago by xd904476

Thanks Grenville,
I did that but I have run out of space on /nerc/n02/n02/dflocco. I would like to delete some of the old files of this suite but I am not sure which ones.
The 12 months are there because the suite run for 59 minutes (|at least it did a few days ago). But I can reduce the cycling to be on the safe side.
thanks,
dani

comment:10 Changed 8 months ago by grenville

I increased the rdf quota - it may not be usable til tomorrow I'm afraid.
Please see my copy of your suite in case I've not been clear which settings to change (see /home/grenville/roses/u-be699)

grenville

comment:11 Changed 8 months ago by xd904476

Thanks Grenville,
I have just deleted the "archive" dir on my rdf.
I'll set the run to go again.

Thank you,
dani

comment:12 Changed 8 months ago by xd904476

Hi Grenville,
I have compared the changes in our directories and tried to fix things, but some of the switches were not right for me.
I have already run successfully this suite a few days ago and I need now to setup the proper run. Therefore I have not put to false the PP transfer.
By making these changes though, my suite stops at the "couple" stage. Perhaps still something to do with the processors or domains?
thanks,
dani

comment:13 Changed 8 months ago by ros

Hi Dani,

Domain decomposition → atmosphere set IO Server processes = 0
Domain decomposition → ocean set Number of processes in XIOS server = 6

Do a 1 month test run first. See how long the 1 month coupled task takes to run (See walltime at the bottom of the job.out file).

Then set up for your long run. Set Total Run length and then appropriate cycling frequency based on the 1 month timings - suspect 12 month cycling will still be ok.

Hope that helps.
Cheers,
Ros.

comment:14 Changed 8 months ago by ros

  • Platform set to ARCHER
  • Resolution set to fixed
  • Status changed from new to closed
  • UM Version set to 10.7

Suite successfully running, so closing this ticket now.

Note: See TracTickets for help on using tickets.