Opened 4 years ago

Closed 3 years ago

#2269 closed help (answered)

GC3 submission error

Reported by: charlie Owned by: annette
Component: UM Model Keywords: Out of memory
Cc: Platform: ARCHER
UM Version: 10.6



I am trying to run a GC3.0 simulation (u-ap908) and have received the following error during the coupled stage (more or less straight after it began running):

[FAIL] GC3-coupled # return-code=137

The last time I got this error, it was to do with having the wrong number of ocean processes - but, with help from Ros, I fixed this. I have run this suite successfully for one month, and it worked fine. The only difference between my test run, and my current run, is the run length and cycling. In my test run, I was running for 1 month with a wallclock time of 2 hours and a cycling period of 10 days. In my current run, I am running for 10 years with a wallclock time of 20 hours and a cycling period of 1 year. Unless I have misunderstood something, I thought this is what Ros told me to do. Please can you help?


Change History (27)

comment:1 Changed 4 years ago by willie

  • Keywords Out of memory added
  • Owner changed from um_support to willie
  • Status changed from new to accepted

Hi Charlie,

This issue is the line above this error,

[NID 01057] 2017-09-09 02:48:39 Apid 28503810: OOM killer terminated this process.

This means the coupled task has run out of memory. You need to increase the number of processors. Try doubling the number.


comment:2 Changed 4 years ago by charlie

Hi Willie,

Thanks, but why has this happened? As I said, I have already run this suite successfully for one month, with the same number of processors (16 x 18 for the atmosphere). So why is there now a problem, when all I have done is increased the run length and cycling period? What should I change the atmosphere processors to? And should I increase the ocean processors as well, and if so, to what?


comment:3 Changed 4 years ago by willie

  • Owner changed from willie to annette
  • Status changed from accepted to assigned

It looks like the suite has fallen over quite a way into the first cycle (after 8 hours of runtime). It is not clear which component has run out of memory (atmos, ocean or xios), but in our experience it is usually xios which needs more memory when you increase the cycle length.
To increase the nodes for XIOS, edit the rose-suite.conf file to change XIOS_NPROC as follows:


In case this doesn't work, you can also add an environment variable to print the node ids and ranks so that we can see which component has failed. To do this edit the suite.rc file and under [[coupled]] and [[[environment]]] add the following line:


This can add a lot of prints to the log files, so if it seems to run for a couple of cycles you can remove the line and run rose suite-run --reload to pick up the changes without otherwise affecting the run.

comment:4 Changed 4 years ago by charlie

Thanks Annette. I've now done as you suggested, i.e. increased the number of processors for XIOS and have added the extra line into my suite.rc. If this does work, i.e. if it runs for a couple of cycles or more, then can I just remove that line and run that command, even if the suite is still running (i.e. without disrupting the cycles or queueing)?

comment:5 Changed 4 years ago by annette

Yep just run the reload command and it will be picked up on the next cycle. Assuming you haven't made any other changes it will not affect the run in any other way.


comment:6 Changed 4 years ago by charlie

Hi Annette,

Having done as you suggested, my suite has again failed with the same message: OOM killer terminated this process

It fell over more or less at the same point as before, during the first cycle after about 8 hours (submitted 14:49:02, started 01:18:00 and finished 09:55:00).

I added that line you said, so where do we need to look to find out which stage run out of processors - clearly it wasn't just XIOS.


comment:7 Changed 4 years ago by annette

Hi Charlie,

In the job.err file you can see which MPI ranks are running on which nodes. From this it looks like it is the ocean that has run out of memory. It is not clear though why it is failing so far into the run (after over 7 months of model time).

We could increase the number of ocean and sea-ice processors, but it might be easier just to reduce the cycle length, so that you only run for 6 months at a time. You should also decrease the requested run time accordingly to reduce queuing time.

It is not clear that whether running with longer or shorter job times is more efficient in terms of ARCHER queue time. I don't think there is an optimal solution.


comment:8 Changed 4 years ago by charlie

Thanks Annette. I have just resubmitted my suite, reducing my cycling period down to 6 months and my wall clock time to 12 hours (given that it can do 1 year in ~18 hours). I will let you know what happens…

comment:9 Changed 4 years ago by annette

Hi Annette

I resubmitted my suite over the weekend, reducing the cycling period to 6 months, and it got a bit further but then fell over. I can't send you the exact error as I'm in a meeting in Oxford all day, but would you be able to take a look at my log file to see what went wrong this time. My suite is u-ap908.


comment:10 Changed 4 years ago by grenville


I think you ran out of space on /work.


comment:11 Changed 4 years ago by charlie

Hi Grenville,

Okay, I see what you mean. This is because, as well as my monthly diagnostics (which I want to keep, but which are only 145M per month so relatively small), it is also outputting 488G worth of CICE and NEMO data. These appear to be mostly start dumps.

Is there a way I can automatically remove all of this, either within my suite or by running some sort of CRON job every 24 hours to delete these? If the latter, how do I set up a CRON job on Archer? Do I need to keep any of these start dumps, for restarting? In other words, can I delete all of them, or just some of them?

As you will see, have already dramatically cut down my STASH so that it is only outputting data to the *pm stream. Other than these, I don't need any other output.


comment:12 Changed 4 years ago by charlie


Sorry to bother you again as I realise you are very busy with other requests, but did anyone get a chance to respond to my questions above?


comment:13 Changed 4 years ago by charlie

Further to my last message, I have now had a good look at the CICEhist and NEMOhist directories and they appear to be all restart dumps. Is there an equivalent of the old "Delete superseded start dumps" within Rose? This would probably solve my problem.


comment:14 Changed 4 years ago by annette

Hi Charlie,

We are looking into this for you. The archiving with the coupled model is not as straightforward as it might be.


comment:15 Changed 4 years ago by charlie

Thanks Annette. But I can't believe I am the only person to have this problem? Surely other people, doing equally long runs, have had to do this - otherwise, with each month taking up that amount of storage, everyone would run out all the time? This must have come up before?


comment:16 Changed 4 years ago by grenville


u-ap908 does not have post processing enabled, so will not delete anything. In the Build and Run section chose Post Processing. Then go to postproc and configure the app. There are option to delete superceded files — I am not sure how it handles NEMO/CICE files. I have run with postproc on and have not ended up with masses of NEMO/CISE restarts.

There is an option to convert to pp, so that will reduce atmosphere volumes by a factor of 2.

There are many post processing options — it's probably best that you select those to suite your run, then test that it's done what you want.


comment:17 Changed 4 years ago by grenville


u-ap908 is not set up correctly to post process (Annette just alerted me) — it appears to be a converted Monsoon job.

There will be several changes required to fcm_make_pp and postproc - u-an561 has ARCHER post processing included and working. That'd be a good template to work from.


comment:18 Changed 4 years ago by charlie


Thanks very much, and sorry for not replying sooner. I confess I am a little bit out of my depth when it comes to configuring these 2 apps. I have turned on postprocessing within Build and run, but don't really understand the various options within the postprocessing app so don't really know what I want to set them to. What else do I need to change, apart from in this app? It's difficult to simply compare my suite to the one you mentioned above, as there are a lot of differences (obviously) and I'm sure which ones to change.

Please can you help? As I said, all I want to do is run the beast and keep my monthly means. I don't need to keep or archive any start dumps (apart from what is needed to restart), and I don't need any ocean/ice output.


comment:19 Changed 4 years ago by charlie

  • not sure.

comment:20 Changed 4 years ago by grenville


We are testing switching off NEMO output - is should be simple enough, but we've never run a coupled model like this.


comment:21 Changed 4 years ago by charlie

Thank you Grenville, that's much appreciated.

I had a thought - if your test works, instead of doing all of this by email would it be easier for you if I popped round your office (or Annette's, or whoever is free) and we setup my suite together? Very happy to do that, if it's easier than doing all of this remotely. I will next be in the office on Thursday morning, so could easily pop round if one of you is free.


comment:22 Changed 4 years ago by grenville


Easy enough to switch off NEMO output and reduce the number of NEMO restarts written, I'm still working on testing the deletion of superseded files. ARCHER downtime yesterday hasn't helped.


comment:23 Changed 4 years ago by charlie


Sorry to bother you, but I was just wondering if there had been any progress on this?


comment:24 Changed 4 years ago by grenville


Some — but ARCHER didn't cooperate over the week end. I need it to run on long enough to test the posts processing; it's not managed that yet.


comment:25 Changed 4 years ago by charlie

Thanks Grenville, and sorry - I don't mean to hassle you. I'm fully aware of the frustrations Archer causes whenever it has a paddy.

comment:26 Changed 4 years ago by grenville

Hi Charlie

I believe we now have a configuration which doesn't keep unnecessary NEMO/CICE start files, doesn't create any NEMO diagnostic output, converts UM files to pp and drops them in /nerc.

Please take a look at my copy of your suite u-ap908. There are quite a lot of changes.

Please move your current u-ap908 and copy mine (on PUMA), that way, you will be able to commit changes (ie it will be yours again)

I still have not correctly configured CICE to switch off diagnostic output - I'll keep tinkering, but think ultimately, that will entail a simple namelist change.

I reduced the cycle time to monthly — just for testing.


comment:27 Changed 3 years ago by willie

  • Resolution set to answered
  • Status changed from assigned to closed
Note: See TracTickets for help on using tickets.