Opened 11 days ago

Last modified 12 hours ago

#2269 assigned help

GC3 submission error

Reported by: charlie Owned by: annette
Priority: normal Component: UM Model
Keywords: Out of memory Cc:
Platform: ARCHER UM Version: 10.6

Description

Hi,

I am trying to run a GC3.0 simulation (u-ap908) and have received the following error during the coupled stage (more or less straight after it began running):

[FAIL] GC3-coupled # return-code=137

The last time I got this error, it was to do with having the wrong number of ocean processes - but, with help from Ros, I fixed this. I have run this suite successfully for one month, and it worked fine. The only difference between my test run, and my current run, is the run length and cycling. In my test run, I was running for 1 month with a wallclock time of 2 hours and a cycling period of 10 days. In my current run, I am running for 10 years with a wallclock time of 20 hours and a cycling period of 1 year. Unless I have misunderstood something, I thought this is what Ros told me to do. Please can you help?

Charlie

Change History (13)

comment:1 Changed 8 days ago by willie

  • Keywords Out of memory added
  • Owner changed from um_support to willie
  • Status changed from new to accepted

Hi Charlie,

This issue is the line above this error,

[NID 01057] 2017-09-09 02:48:39 Apid 28503810: OOM killer terminated this process.

This means the coupled task has run out of memory. You need to increase the number of processors. Try doubling the number.

Regards,
Willie

comment:2 Changed 8 days ago by charlie

Hi Willie,

Thanks, but why has this happened? As I said, I have already run this suite successfully for one month, with the same number of processors (16 x 18 for the atmosphere). So why is there now a problem, when all I have done is increased the run length and cycling period? What should I change the atmosphere processors to? And should I increase the ocean processors as well, and if so, to what?

Charlie

comment:3 Changed 7 days ago by willie

  • Owner changed from willie to annette
  • Status changed from accepted to assigned

Charlie,
It looks like the suite has fallen over quite a way into the first cycle (after 8 hours of runtime). It is not clear which component has run out of memory (atmos, ocean or xios), but in our experience it is usually xios which needs more memory when you increase the cycle length.
To increase the nodes for XIOS, edit the rose-suite.conf file to change XIOS_NPROC as follows:

XIOS_NPROC=16

In case this doesn't work, you can also add an environment variable to print the node ids and ranks so that we can see which component has failed. To do this edit the suite.rc file and under [[coupled]] and [[[environment]]] add the following line:

MPICH_CPUMASK_DISPLAY=1

This can add a lot of prints to the log files, so if it seems to run for a couple of cycles you can remove the line and run rose suite-run --reload to pick up the changes without otherwise affecting the run.
Annette

comment:4 Changed 7 days ago by charlie

Thanks Annette. I've now done as you suggested, i.e. increased the number of processors for XIOS and have added the extra line into my suite.rc. If this does work, i.e. if it runs for a couple of cycles or more, then can I just remove that line and run that command, even if the suite is still running (i.e. without disrupting the cycles or queueing)?

comment:5 Changed 7 days ago by annette

Yep just run the reload command and it will be picked up on the next cycle. Assuming you haven't made any other changes it will not affect the run in any other way.

Annette

comment:6 Changed 6 days ago by charlie

Hi Annette,

Having done as you suggested, my suite has again failed with the same message: OOM killer terminated this process

It fell over more or less at the same point as before, during the first cycle after about 8 hours (submitted 14:49:02, started 01:18:00 and finished 09:55:00).

I added that line you said, so where do we need to look to find out which stage run out of processors - clearly it wasn't just XIOS.

Charlie

comment:7 Changed 6 days ago by annette

Hi Charlie,

In the job.err file you can see which MPI ranks are running on which nodes. From this it looks like it is the ocean that has run out of memory. It is not clear though why it is failing so far into the run (after over 7 months of model time).

We could increase the number of ocean and sea-ice processors, but it might be easier just to reduce the cycle length, so that you only run for 6 months at a time. You should also decrease the requested run time accordingly to reduce queuing time.

It is not clear that whether running with longer or shorter job times is more efficient in terms of ARCHER queue time. I don't think there is an optimal solution.

Annette

comment:8 Changed 6 days ago by charlie

Thanks Annette. I have just resubmitted my suite, reducing my cycling period down to 6 months and my wall clock time to 12 hours (given that it can do 1 year in ~18 hours). I will let you know what happens…

comment:9 Changed 3 days ago by annette

Hi Annette

I resubmitted my suite over the weekend, reducing the cycling period to 6 months, and it got a bit further but then fell over. I can't send you the exact error as I'm in a meeting in Oxford all day, but would you be able to take a look at my log file to see what went wrong this time. My suite is u-ap908.

Charlie

comment:10 Changed 3 days ago by grenville

Charlie

I think you ran out of space on /work.

Grenville

comment:11 Changed 3 days ago by charlie

Hi Grenville,

Okay, I see what you mean. This is because, as well as my monthly diagnostics (which I want to keep, but which are only 145M per month so relatively small), it is also outputting 488G worth of CICE and NEMO data. These appear to be mostly start dumps.

Is there a way I can automatically remove all of this, either within my suite or by running some sort of CRON job every 24 hours to delete these? If the latter, how do I set up a CRON job on Archer? Do I need to keep any of these start dumps, for restarting? In other words, can I delete all of them, or just some of them?

As you will see, have already dramatically cut down my STASH so that it is only outputting data to the *pm stream. Other than these, I don't need any other output.

Charlie

comment:12 Changed 13 hours ago by charlie

Hi,

Sorry to bother you again as I realise you are very busy with other requests, but did anyone get a chance to respond to my questions above?

Charlie

comment:13 Changed 12 hours ago by charlie

Further to my last message, I have now had a good look at the CICEhist and NEMOhist directories and they appear to be all restart dumps. Is there an equivalent of the old "Delete superseded start dumps" within Rose? This would probably solve my problem.

Charlie

Note: See TracTickets for help on using tickets.