Opened 5 months ago

Closed 4 months ago

#3075 closed help (completed)

missing months with postprocessing task and Gregorian calendar

Reported by: Leighton_Regayre Owned by: um_support
Component: UM Model Keywords: nudge Gregorian pp output stream
Cc: Platform: ARCHER
UM Version: 11.1

Description

Hello,

My suite u-bo845 is being used to spin-up a copy of the UKESM (UM11.1) with the Gregorian calendar, for use in a perturbed parameter ensemble. Hence, it has an additional folder for output labelled "ens_0".

The postprocessing task for this job isn't performing as expected. .pp files are made for some months and not others. Additionally, the size of some .pp files is double the smallest .pp files, which suggests data from two months has been merged.

I need .pp files from the .pn stream for distinct months so that I can efficiently analyse output from over 200 ensemble members. pn stream files on /work are created for all months, but only some are converted to .pp files then deleted.

Output on /work is here:
/work/n02/n02/lre/cylc-run/u-bo845/share/data/History_Data/ens_0

and transferred onto /nerc here:
/nerc/n02/n02/lre/rose_archiving/u-bo845_ens000/

This suite also failed before completion,though output was created for June, the final month of the run. Error messages indicate the job failed because of a complication with an ancillary file, but I'm not sure which one. The error message indicates difficulty finding an entry for field 5, stashcode 216, which looks to be "total precipitation rate", but that isn't an ancillary field, so I'm confused.

Thanks,

Leighton

Change History (35)

comment:1 Changed 5 months ago by grenville

Leighton

REPLANCA is complaining about stash item 216 (fractions of surface types) in here (the 5 does not refer to a model section)

'$CMIP6_ANCILS/model_derived/ukesm1.0_historical_r5i1p1f3_u-az513/n96e/timese ries_1979-2014/vegetation/v1/u-az513_m01s00i216_1979-2014_annual_timeseries_land_cover_frac.anc'

I can't open this file in xconv - is this one you created?

Grenville

comment:2 Changed 5 months ago by Leighton_Regayre

Grenville,

Thanks for locating the problem ancillary. I didn't create this ancillary. I was aware it would cause problems at some stage, though had been told it would be after 2015. Mohit is coordinating the transfer of a climatology to replace this ancillary into $CMIP6_ANCILS, but it's a CMIP6 process and is taking some time.

Do you have any advice on how to proceed with the .pp file creation and transfer? Should I remove the .pp creation task and convert field files manually instead?

Thanks,

Leighton

comment:3 Changed 5 months ago by grenville

Leighton

postproc is working correctly - it doesn't archive files which it thinks have no data - see
/home/n02/n02/lre/cylc-run/u-bo845/log/job/20150301T0000Z/postproc000/01/job-archive.log for example. Many of the files in ens_0 are thought to be empty by pp.

I can't open bo845a.pn2015feb in xconv nor can cf python read the file. Have you tried to inspect the model output?

Grenville

comment:4 Changed 5 months ago by Leighton_Regayre

Hi Grenville,

You're correct that the feb field file is empty when converted to .pp, though that's not evident when looking at the filesize of the field file.

Since the February files have no data, I now think this relates to ticket #2933 (discussed again in ticket #2995). In ticket #2933 you advised I alter filename_base for output stream pp130 (monthly mean data). I used .pn rather than the default .pm.

In ticket #2995 Ros helped me set the reinitialisation period of my .pn stream and we discussed why postproc moves monthly mean files to incorrect month folder on /nerc. I ran the model for a few months as Ros suggested an saw that the model was producing monthly mean files. However, I had no need to examine the data at that stage.

I think there is something wrong with the .pn stream. I need monthly mean data in distinct files, so that I can analyse the PPE. Do you have any idea why the Feb file is empty, yet the March file is twice as large as it should be?

Thanks,

Leighton

comment:5 Changed 5 months ago by grenville

Leighton

Not much help, but
/nerc/n02/n02/lre/rose_archiving/u-bo845_ens000/20150401T0000Z/bo845a.pn2015mar.pp

does contain data for Feb 15th and March 17th - I've no idea how this has happened.

Grenville

comment:6 Changed 5 months ago by Leighton_Regayre

Grenville,

Yes, I figured that was the case from the file size comparison. I need to be sure the monthly mean values are being calculated correctly and ideally these would go into distinct files.

I've emailed Luke Abraham to see if he has an idea of what's going on. Is there anyone else you could suggest I contact?

Thanks,

Leighton

comment:7 Changed 5 months ago by luke

Hi Leighton,

Mohit Dalvi at the Met Office might be the best person to contact as he is the nudging owner and he makes up nudged UKCA configurations, e.g. here

https://www.ukca.ac.uk/wiki/index.php/GA7.1_StratTrop_suites#TS2000_nudged_suites

and here

https://www.ukca.ac.uk/wiki/index.php/Nudged_UKESM1-AMIP

Indeed, the above link may be helpful for this, specifically point 5 regarding using Real Months. Is your suite based on one of the nudged UKESM1-AMIP jobs that he made up? If so the required changes should have been made for existing streams.

Did you make the pn stream yourself? If so you need to make sure that the real month option is set. If you are using 30-day months you could be getting 15th Feb to 17th March as this would be 30-days.

Regards,
Luke

comment:8 Changed 5 months ago by Leighton_Regayre

Luke,

Thanks for the links. There are some differences in Mohit's recommendations that were not in the documentation I accessed at the time of creation. I think Mohit made his AMIP version of the UKESM release in response to postprocessing issues we were having, documented in my ticket #2933.

I've review Mohit's updated advice and implement changes accordingly.

Cheers,

Leighton

comment:9 Changed 5 months ago by luke

Hi Leighton,

Also, looking through the time profile options, I don't think that it's possible to mean over real months. While output streams can be re-initialised every real month, the meaning can only be done in multiples of timesteps/hours/days/dump periods. Therefore all monthly-mean output from a nudged simulation must be sent to TDMPMN/UPMEAN as there is code to do the monthly-meaning correctly if these options are selected.

Thanks,
Luke

comment:10 Changed 5 months ago by Leighton_Regayre

Hi Luke,

Thanks for thinking about this further. On Friday I tested setting all monthly diagnostics of interests to use the TDMPMN and UPMEAN profiles, as indicated on the advice page Mohit set up. This conflicts with the advice I was given on cms trac ticket #2933. Following Mohits setup, I've removed the pp130 output stream. My suite runs successfully, but no .pp files are made and there doesn't seem to be any monthly mean output. Clearly, I've done something wrong, but I've double-checked Mohits advice page (2nd link in comment 7) and everything seems correct to me.

Any additional advice would be very welcome as I have no idea how best to proceed here.

Thanks,

Leighton

comment:11 Changed 5 months ago by luke

I'm afraid that I can't comment on postproc usage/settings on ARCHER.

In terms of your suite - where are you looking for the .pm file? In work/, shared/ or on the /nerc disk? If the latter, is it appearing in the cylc-run/ directory structure?

One possible test - if you take a copy of Mohit's u-bm251 from

https://www.ukca.ac.uk/wiki/index.php/Nudged_UKESM1-AMIP

can you run it and produce a pm output file? If so, I'd suggest diffing the suites and look for differences as to how the files are handled in the UM and/or how the postproc app is configured.

Thanks,
Luke

comment:12 Changed 5 months ago by Leighton_Regayre

Hi Luke,

There aren't any .pm files, nor .pp files created in either /work or /nerc after following the advice on Mohit's nudging setup page. I had copied Mohit's u-bb210 suite following comment 7 here, which uses vn11.0 (not 11.1) and is set up for use on ARCHER, but unfortunately this is a GA7.1 suite so has many differences with my UKESM1-A suite. Is u-bm251 a more suitable suite for comparison?

Thanks,

Leighton

comment:13 Changed 5 months ago by luke

Hi Leighton,

u-bm251 is a nudged UKESM1-AMIP suite at vn11.1 so it likely closest to your current set-up.

I would suggest just seeing if this jobs works for you without any changes (other than as needed by usernames, paths etc.). If it does then differences with your current suite can be investigated.

Thanks,
Luke

comment:14 Changed 5 months ago by Leighton_Regayre

Hi Luke,

I'm not able to run jobs at the moment because of an ARCHER node issue (#3082), though I've compared my suite to Mohit's and can see some fundamental differences in how the postprocessing climate meaning is set up.

Firstly, u-bm251 has climate meaning switched on and pointing to the .pm stream. I had this on, but was still pointing to the .pn stream, which was made in response to stream #2933. Secondly, in the postproc/Archive integrity/Atmosphere panel, u-bm251 has pp_climatemeans on. Mine was off. I've added a "1m" meanstream to get monthly data and set the mean_reference_date to 20150101 to match my restart file. Hopefully this is correct. I'll test as soon as the ARCHER login node problem is fixed.

Thanks again,

Leighton

comment:15 Changed 5 months ago by grenville

Leighton

I can't recall the exact circumstances of the original query; I do remember problems with pp base streams causing problems which prompted the suggestion to got to STASH meaning, but somewhere along the line the gregorian calendar got forgotten. Others (Mohit) have recognized the issue and solved it, thankfully.

Grenville

comment:16 Changed 4 months ago by Leighton_Regayre

Hello,

You're right Grenville, we were working on these output streams before Mohit tackled the task. I made a lot of other changes to my suite so was reluctant to start again with Mohit's UKESM-AMIP release version. That would be the sensible approach for anyone starting to set up a suite like this. However, with much help from Mohit, I now have monthly mean data produced using the climate_means process, in the .pm stream.

One remaining concern I have is that the .pm field files aren't processed by the postprocessing task and hence aren't copied to /nerc. Since I'm going to be submitting a large number of simulations at once, it's preferable to have the .pp files from the .pm stream made and moved automatically. My postprocessing task is set up identically to Mohit's UKESM-release job u-bm251, which apparently created the .pp files correctly.

I was wondering if the order of the create_means code and the postprocessing task had been changed. Any other ideas?

Thanks,

Leighton

comment:17 Changed 4 months ago by grenville

Leighton

Feels like we are going round in circles - wasn't the pm stream and post-proc the initial problem. We're looking at this - the post-proc guru is out 'til next week.

Grenville

comment:18 Changed 4 months ago by luke

Grenville, Leighton,

For the most recent iterations of Mohit's nudged configurations of the UKESM1-AMIP jobs, the use of climate meaning to produce monthly-means is essential. It is not possible to do monthly-means by any other method. I believe that this is because STASH output streams cannot work with "real months" correctly, whereas climate means can.

Thanks,
Luke

comment:19 Changed 4 months ago by luke

Or rather, the output files can cope with "real months", but the time-meaning profiles cannot.

comment:20 Changed 4 months ago by Leighton_Regayre

Hello,

Thanks for clarifying Luke. I'm happy with the monthly mean output I'm getting from the climate means process. No more going in circles on this for me.

Grenville, can you clarify what you mean by post-proc guru is out til next week? I've had pp files from my .pm (climate means) stream created and transferred to /nerc in the early hours of this morning. Other PPE members in the same suite have failed the postproc task because of a submission error.

Thanks,

Leighton

comment:21 Changed 4 months ago by Leighton_Regayre

Also, I'm wondering if turning off the post-proc task will affect the climate_means calculation?

comment:22 Changed 4 months ago by grenville

Hi Leighton

I was referring to your comment "One remaining concern I have is that the .pm field files aren't processed by the postprocessing task and hence aren't copied to /nerc" - but I see it's no longer a concern.

You appear to have post-proc climate meaning switched on in addition to UM internal climate meaning - the "help" for create_monthly_mean says Please leave as "False" if STASH-produced monthly means are to be archived as a.pm - which is not what you have; which is another reason to have "guru" help - it could just be that the help message is wrong of course.

It's not clear what would happen if post proc is switched off given the way the suite is configured

Grenville

comment:23 Changed 4 months ago by ros

Hi Leighton,

I've been watching your u-bo845 suite today and it looks like it is correctly archiving the .pm files - I've just looked at 087, 088 & 089…. As far as I can see postproc is working ok. What other files are you expecting to be archived??

Cheers,
Ros.

comment:24 Changed 4 months ago by Leighton_Regayre

Hi ros,

Thanks for watching the suite progress.

This postproc issue was discussed in #3096 so if you think it's useful to close that ticket and continue here I don't mind.

The suite u-bo845 was submitted yesterday with ensemble members 87-113 scheduled to run for 2 months each. The postproc task is working a little better than last week. Previously, the postproc task only transferred some of the output and only for some ensemble members. It rarely made .pp files for the monthly mean output from the pm stream for Gregorian calendar months. This week when the postproc task works, it is making the .pp files correctly and transferring them to /nerc. This is a great improvement.

There is still an issue with the postproc task failing for some ensemble members. e.g. no folders were made on /nerc for ensemble members 92, 93, 95, 96, 102, 104, 108 and 110. However, the postproc task failed to submit for these ensemble members. I think this part of my problem is likely related to the submission system, which is a local issue.

Thanks,

Leighton

comment:25 Changed 4 months ago by ros

Hi Leighton,

I continued the discussion here because all the previous discussion regarding postproc for this suite are in this ticket. #3096 relates to a request for quota increase.

Anyway, I'll take a look at the failed postprocs and see what the problem was.

Cheers,
Ros.

comment:26 Changed 4 months ago by ros

Hi Leighton,

Ok so the problem with the submission of some of the postproc tasks is simple; you're exceeding the per-user limit on the number of jobs one person can submit to the serial queue - see message in the task's job-activity.log. In the serial queue you can only have 12 jobs in the queue per user.

The easiest thing is in your suite.rc file change the existing queue limit in the [[queues]] section so that the limit is 12. Then do a rose suite-run --reload to pick up the change.

Hopefully that will solve all the problems.

Regards,
Ros.

comment:27 Changed 4 months ago by Leighton_Regayre

Hi Ros,

Thanks! I was aware that the serial queue limit was 12 for postproc but didn't know where to change it. I take it changing this limit will also reduce the number of atmos_main tasks that can be run in parallel since these tasks are completed sequentially. Is that correct?

Thanks again,

Leighton

comment:28 Changed 4 months ago by ros

Hi Leighton,

Yes, it will reduce the number of atmos_main tasks as well.

Cheers,
Ros.

comment:29 Changed 4 months ago by grenville

Leighton

The log files indicate that your pumatest quota has been exceeded - configure the suite to not retrieve log files (something like this)

HPC?


[remote?]

host = $(rose host-select archer)

How long do you expect your runs to take? You only have 'til Feb 18 to move everything that needs saving off ARCHER/RDF.

Grenville

comment:30 Changed 4 months ago by grenville

missed the important line

HPC?


[remote?]

host = $(rose host-select archer)

retrieve job logs = False

comment:31 Changed 4 months ago by Leighton_Regayre

Hi Grenville,

Thanks. I've added the line you suggest to my site/archer.rc file, so will run some ensemble members with this today. I've been deleting the log files every time to make space on pumatest.

I'm aware of the Feb 18th deadline. What I'm doing is the 1st phase of actually making the PPE. There'll be an interlude for a science phase, then I'll be working flat out to get the 2nd PPE creation phase complete. How long it takes will depend on queue times, job failures, etc. You'll know from last time we made a PPE it's not a straight-forward process. The UKESM1 seems to be more stable than the UM8.4 version we used for the last PPE, in that it's coping with all perturbed parameter combinations without complaint so far.

Cheers,

Leighton

comment:32 Changed 4 months ago by Leighton_Regayre

Hi Ros,

Regarding the queue limits:

Chris Symonds in our CEMAC group at Leeds has suggested I could have 16 atmos_main jobs running on the parallel queue and 12 jobs running on the serial queue if I added the following to my suite.rc file:

{% if SITE == 'archer' %}

queues?

[parallel_queue?]

limit = 16
members = ATMOS_RESOURCE

[serial_queue?]

limit = 12
members = POSTPROC_RESOURCE, PPTRANSFER_RESOURCE, HOUSEKEEP_RESOURCE, SUPERMEAN_RESOURCE, LOGS_RESOURCE, WALLCLOCK_RESOURCE

{% endif %}

Does that seem a sensible approach?

Thanks,

Leighton

comment:33 Changed 4 months ago by ros

Hi Leighton,

Yes you can set up as many internal queues as you like. I just didn't suggest it as I didn't think it would really gain you that much.

Cheers,
Ros.

comment:34 Changed 4 months ago by Leighton_Regayre

Hi Ros,

Thanks for the quick response. The postproc tasks are relatively quick, so a queue limit of 12 won't affect them. Increasing my queue limit for atmos_main tasks from 12 to 16 is a big deal though, since I have a few hundred simulations to run. I'll go ahead and implement the discrete queue limits.

Thanks again,

Leighton

comment:35 Changed 4 months ago by ros

  • Resolution set to completed
  • Status changed from new to closed
Note: See TracTickets for help on using tickets.