Opened 2 months ago

Last modified 7 weeks ago

#2269 assigned help

GC3 submission error

Reported by: charlie Owned by: annette
Priority: normal Component: UM Model
Keywords: Out of memory Cc:
Platform: ARCHER UM Version: 10.6

Description

Hi,

I am trying to run a GC3.0 simulation (u-ap908) and have received the following error during the coupled stage (more or less straight after it began running):

[FAIL] GC3-coupled # return-code=137

The last time I got this error, it was to do with having the wrong number of ocean processes - but, with help from Ros, I fixed this. I have run this suite successfully for one month, and it worked fine. The only difference between my test run, and my current run, is the run length and cycling. In my test run, I was running for 1 month with a wallclock time of 2 hours and a cycling period of 10 days. In my current run, I am running for 10 years with a wallclock time of 20 hours and a cycling period of 1 year. Unless I have misunderstood something, I thought this is what Ros told me to do. Please can you help?

Charlie

Change History (26)

comment:1 Changed 2 months ago by willie

  • Keywords Out of memory added
  • Owner changed from um_support to willie
  • Status changed from new to accepted

Hi Charlie,

This issue is the line above this error,

[NID 01057] 2017-09-09 02:48:39 Apid 28503810: OOM killer terminated this process.

This means the coupled task has run out of memory. You need to increase the number of processors. Try doubling the number.

Regards,
Willie

comment:2 Changed 2 months ago by charlie

Hi Willie,

Thanks, but why has this happened? As I said, I have already run this suite successfully for one month, with the same number of processors (16 x 18 for the atmosphere). So why is there now a problem, when all I have done is increased the run length and cycling period? What should I change the atmosphere processors to? And should I increase the ocean processors as well, and if so, to what?

Charlie

comment:3 Changed 2 months ago by willie

  • Owner changed from willie to annette
  • Status changed from accepted to assigned

Charlie,
It looks like the suite has fallen over quite a way into the first cycle (after 8 hours of runtime). It is not clear which component has run out of memory (atmos, ocean or xios), but in our experience it is usually xios which needs more memory when you increase the cycle length.
To increase the nodes for XIOS, edit the rose-suite.conf file to change XIOS_NPROC as follows:

XIOS_NPROC=16

In case this doesn't work, you can also add an environment variable to print the node ids and ranks so that we can see which component has failed. To do this edit the suite.rc file and under [[coupled]] and [[[environment]]] add the following line:

MPICH_CPUMASK_DISPLAY=1

This can add a lot of prints to the log files, so if it seems to run for a couple of cycles you can remove the line and run rose suite-run --reload to pick up the changes without otherwise affecting the run.
Annette

comment:4 Changed 2 months ago by charlie

Thanks Annette. I've now done as you suggested, i.e. increased the number of processors for XIOS and have added the extra line into my suite.rc. If this does work, i.e. if it runs for a couple of cycles or more, then can I just remove that line and run that command, even if the suite is still running (i.e. without disrupting the cycles or queueing)?

comment:5 Changed 2 months ago by annette

Yep just run the reload command and it will be picked up on the next cycle. Assuming you haven't made any other changes it will not affect the run in any other way.

Annette

comment:6 Changed 2 months ago by charlie

Hi Annette,

Having done as you suggested, my suite has again failed with the same message: OOM killer terminated this process

It fell over more or less at the same point as before, during the first cycle after about 8 hours (submitted 14:49:02, started 01:18:00 and finished 09:55:00).

I added that line you said, so where do we need to look to find out which stage run out of processors - clearly it wasn't just XIOS.

Charlie

comment:7 Changed 2 months ago by annette

Hi Charlie,

In the job.err file you can see which MPI ranks are running on which nodes. From this it looks like it is the ocean that has run out of memory. It is not clear though why it is failing so far into the run (after over 7 months of model time).

We could increase the number of ocean and sea-ice processors, but it might be easier just to reduce the cycle length, so that you only run for 6 months at a time. You should also decrease the requested run time accordingly to reduce queuing time.

It is not clear that whether running with longer or shorter job times is more efficient in terms of ARCHER queue time. I don't think there is an optimal solution.

Annette

comment:8 Changed 2 months ago by charlie

Thanks Annette. I have just resubmitted my suite, reducing my cycling period down to 6 months and my wall clock time to 12 hours (given that it can do 1 year in ~18 hours). I will let you know what happens…

comment:9 Changed 2 months ago by annette

Hi Annette

I resubmitted my suite over the weekend, reducing the cycling period to 6 months, and it got a bit further but then fell over. I can't send you the exact error as I'm in a meeting in Oxford all day, but would you be able to take a look at my log file to see what went wrong this time. My suite is u-ap908.

Charlie

comment:10 Changed 2 months ago by grenville

Charlie

I think you ran out of space on /work.

Grenville

comment:11 Changed 2 months ago by charlie

Hi Grenville,

Okay, I see what you mean. This is because, as well as my monthly diagnostics (which I want to keep, but which are only 145M per month so relatively small), it is also outputting 488G worth of CICE and NEMO data. These appear to be mostly start dumps.

Is there a way I can automatically remove all of this, either within my suite or by running some sort of CRON job every 24 hours to delete these? If the latter, how do I set up a CRON job on Archer? Do I need to keep any of these start dumps, for restarting? In other words, can I delete all of them, or just some of them?

As you will see, have already dramatically cut down my STASH so that it is only outputting data to the *pm stream. Other than these, I don't need any other output.

Charlie

comment:12 Changed 2 months ago by charlie

Hi,

Sorry to bother you again as I realise you are very busy with other requests, but did anyone get a chance to respond to my questions above?

Charlie

comment:13 Changed 2 months ago by charlie

Further to my last message, I have now had a good look at the CICEhist and NEMOhist directories and they appear to be all restart dumps. Is there an equivalent of the old "Delete superseded start dumps" within Rose? This would probably solve my problem.

Charlie

comment:14 Changed 2 months ago by annette

Hi Charlie,

We are looking into this for you. The archiving with the coupled model is not as straightforward as it might be.

Annette

comment:15 Changed 2 months ago by charlie

Thanks Annette. But I can't believe I am the only person to have this problem? Surely other people, doing equally long runs, have had to do this - otherwise, with each month taking up that amount of storage, everyone would run out all the time? This must have come up before?

Charlie

comment:16 Changed 2 months ago by grenville

Charlie

u-ap908 does not have post processing enabled, so will not delete anything. In the Build and Run section chose Post Processing. Then go to postproc and configure the app. There are option to delete superceded files — I am not sure how it handles NEMO/CICE files. I have run with postproc on and have not ended up with masses of NEMO/CISE restarts.

There is an option to convert to pp, so that will reduce atmosphere volumes by a factor of 2.

There are many post processing options — it's probably best that you select those to suite your run, then test that it's done what you want.

Grenville

comment:17 Changed 2 months ago by grenville

Charlie

u-ap908 is not set up correctly to post process (Annette just alerted me) — it appears to be a converted Monsoon job.

There will be several changes required to fcm_make_pp and postproc - u-an561 has ARCHER post processing included and working. That'd be a good template to work from.

Grenville

comment:18 Changed 8 weeks ago by charlie

Hi,

Thanks very much, and sorry for not replying sooner. I confess I am a little bit out of my depth when it comes to configuring these 2 apps. I have turned on postprocessing within Build and run, but don't really understand the various options within the postprocessing app so don't really know what I want to set them to. What else do I need to change, apart from in this app? It's difficult to simply compare my suite to the one you mentioned above, as there are a lot of differences (obviously) and I'm sure which ones to change.

Please can you help? As I said, all I want to do is run the beast and keep my monthly means. I don't need to keep or archive any start dumps (apart from what is needed to restart), and I don't need any ocean/ice output.

Charlie

comment:19 Changed 8 weeks ago by charlie

  • not sure.

comment:20 Changed 8 weeks ago by grenville

Charlie

We are testing switching off NEMO output - is should be simple enough, but we've never run a coupled model like this.

Grenville

comment:21 Changed 8 weeks ago by charlie

Thank you Grenville, that's much appreciated.

I had a thought - if your test works, instead of doing all of this by email would it be easier for you if I popped round your office (or Annette's, or whoever is free) and we setup my suite together? Very happy to do that, if it's easier than doing all of this remotely. I will next be in the office on Thursday morning, so could easily pop round if one of you is free.

Charlie

comment:22 Changed 8 weeks ago by grenville

Charlie

Easy enough to switch off NEMO output and reduce the number of NEMO restarts written, I'm still working on testing the deletion of superseded files. ARCHER downtime yesterday hasn't helped.

Grenville

comment:23 Changed 7 weeks ago by charlie

Hi,

Sorry to bother you, but I was just wondering if there had been any progress on this?

Charlie

comment:24 Changed 7 weeks ago by grenville

Charlie

Some — but ARCHER didn't cooperate over the week end. I need it to run on long enough to test the posts processing; it's not managed that yet.

Grenville

comment:25 Changed 7 weeks ago by charlie

Thanks Grenville, and sorry - I don't mean to hassle you. I'm fully aware of the frustrations Archer causes whenever it has a paddy.

comment:26 Changed 7 weeks ago by grenville

Hi Charlie

I believe we now have a configuration which doesn't keep unnecessary NEMO/CICE start files, doesn't create any NEMO diagnostic output, converts UM files to pp and drops them in /nerc.

Please take a look at my copy of your suite u-ap908. There are quite a lot of changes.

Please move your current u-ap908 and copy mine (on PUMA), that way, you will be able to commit changes (ie it will be yours again)

I still have not correctly configured CICE to switch off diagnostic output - I'll keep tinkering, but think ultimately, that will entail a simple namelist change.

I reduced the cycle time to monthly — just for testing.

Grenville

Note: See TracTickets for help on using tickets.