Opened 5 months ago

Closed 4 months ago

Last modified 4 months ago

#2464 closed help (answered)

Postprocessing for AMIP suite on NEXCS

Reported by: charlie Owned by: um_support
Priority: normal Component: UM Model
Keywords: Cc:
Platform: NEXCS UM Version: 10.7

Description

Hi again,

Right then, a totally different question this time and nothing to do with JASMIN, hence the new ticket.

I am almost ready to begin running some new atmosphere-only experiments (GA7.1), using the following suite: u-aw739. This is a modern AMIP suite, which has been modified by myself and others to contain Eocene (~50 Mya) ancillary files. It is currently on Monsoon (or is it Monsoon2, or NEXCS?), not PUMA.

I note that, at present, postprocessing is set to put everything on MASS under :crum however, from my previous conversations with you, I gather I don't want to do this? The project account I am using is n02-nexcs which is where we have allocated resources.

Therefore, please would you advise on what exactly I need to do to set up where my output (both immediate and post-processed) goes?

Many thanks,

Charlie

Attachments (2)

Eocene Ancil Version File for N96 - CW.docx (13.3 KB) - added by charlie 4 months ago.
Ancil Version File for MH & EH.docx (13.1 KB) - added by charlie 4 months ago.

Download all attachments as: .zip

Change History (98)

comment:1 Changed 5 months ago by grenville

Charlie

The details are presented here:

http://cms.ncas.ac.uk/wiki/Docs/PostProcessingApp

I'd urge you to set this up and test it thoroughly before setting off any long runs.

Grenville

comment:2 Changed 5 months ago by charlie

Hi Grenville,

Many thanks, I will work through these instructions tomorrow. Before I do, though, a couple of questions:

Firstly, how do I know whether I need to upgrade to the ARCHER/NEXCS postprocessing release version? As I said, my suite is currently a GA7.1 suite and is archiving data to MASS, so do I still need to go through this upgrade process?

Secondly, the instructions say to change the archive_root_path to /projects/nexcs-n02/<username> but I'm not sure I have a username here yet? What do I need to do to get an account here?

Lastly, in terms of transferring the data to JASMIN, the instructions say to use one of the group workspaces on JASMIN. I have now been granted access to the nexcs workspace, so is this the one I use? If so, how much storage is available to me here? To begin with, I will just be running a couple of test runs (e.g. a year, just to see if everything works) but imminently I will be running some longer simulations - at least several, each of which will be 50 years or so in length. Do I need to worry about filling up the space?

Thanks,

Charlie

comment:3 Changed 5 months ago by ros

Hi Charlie,

The ARCHER/NEXCS postprocessing went in at postproc_2.2 which is not tied to any UM version/configuation. So if your suite is not using this version you will need to upgrade. Look at the fcm_make_pp/postproc apps and you will see what version is being used by the path to the metadata. If that makes no sense just run the upgrade command as instructed; if your suite is already at the correct version it will do nothing.

If you have authorisation to use the nexcs-n02 budget then you can create a projects directory /projects/nexcs-n02/<username>.

As to your last question, I don't know what space is available under JASMIN nexcs workspace. I'll let Grenville answer that one when he returns tomorrow.

Regards,
Ros.

comment:4 Changed 5 months ago by charlie

Thanks Ros.

I have now worked through the upgrade process (because previously it was set to version 2.0) and, much to my surprise, both apps have upgraded successfully. And I have also successfully created my directories, on both NEXCS and the JASMIN NEXCS groupworkspace.

So, I have now followed the rest of the instructions at the above website, including filling in the JASMIN transfer section. One question here: what should the remote host be - jasmin-xfer1.ceda.ac.uk or jasmin-xfer2.ceda.ac.uk? I know that the first is open to everybody (but is slower) and the 2nd needs to be registered for, so is it already registered from NEXCS or do I need to do something here myself?

So, now that this is done, I am about to begin a 2 year test run. Before I do that, however, is there an ideal set of processes for a GA7.1 suite on NEXCS (I know you did lots of testing for Archer, but what about this?)? Or is it just a case of trial and error? At the moment I have the following:

Reconfiguration
East-West = 4, North-South = 8, Open-MP = 1, Hyperthreads = 1

Atmosphere
I/O server = 0, East-West = 16, North-South = 28, Open-MP = 1, Hyperthreads = 1

This sounds about right to me (although it is less that my GC3.1 suite), or do you think it should be higher?

Charlie

comment:5 Changed 5 months ago by charlie

Further to my last message, I have just read the page at http://cms.ncas.ac.uk/wiki/Docs/PostProcessingAppNexcsSetup - I haven't done these steps yet, but would you be able to talk me through step 3 here, as the example is for a coupled suite whereas my suite is atmosphere only? Also, at the bottom of this page it says that "some setup is required to enable non-interactive authentication from NEXCS to JASMIN" and that I should email you about this, so please would you advise?

comment:6 Changed 5 months ago by ros

Hi Charlie,

You can apply for access to the jasmin-xfer2 here:

http://www.jasmin.ac.uk/services/high-performance-data-transfer/

NEXCS is whitelisted so don't worry about supplying an IP address in the application.

ARCHER and XCS are similar machines so the same setup should be fine - just remember that there are 32 cores per node on XCS so make sure the decomposition is divisible by 32 rather than 24.

Please give me write access to your suite and I'll add the relevant stuff into the .rc files as this suite is setup totally different to the example. Add the following to the top of the rose-suite.info file:

access-list=charliewilliams rosalynhatcher

And then fcm commit the suite.

I'll send you the instructions for setting up the authentication from NEXCS to JASMIN by email.

Cheers,
Ros.

comment:7 Changed 5 months ago by charlie

Hi Ros,

Excellent, very many thanks.

Firstly, I already have access to the high-performance data transfer, because I applied for it several months ago. However, this wasn't from NEXCS, but rather from our home server (um1), because at the time I was transferring data from our home servers to JASMIN for archiving. Under my services on the JASMIN website, I am still registered for this service (and will be until 2027!) so do I need to apply again but from a different machine?

Secondly, I have changed the processes as you advised, so that they are now

Atmosphere
I/O server = 6, East-West = 24, North-South = 28, Open-MP = 2, Hyperthreads = 1

I thought this was ok, because 24*28 = 672, which /32 = 21. Or do I need to multiply all of the processes together, in which case 6*24*28*2*1 = 8064, which /32 = 252. So either way they are divisible by 32.

Lastly, I have now added that line into my rose-suite.info and have committed the suite, so you should be able to access it.

Very many thanks,

Charlie

comment:8 Changed 5 months ago by ros

Hi Charlie,

If you already have access to the high performance data transfer node you don't need to do anything else as NEXCS is already registered with JASMIN.

Yes you multiple all the processors together.

Yes I confirm I now have write access to your suite. I'll let you know when I've done.

Cheers,
Ros.

comment:9 Changed 5 months ago by ros

Hi Charlie,

I've modified the suite to add pptransfer to the graph. It was simpler than it looked! Give that a whirl and we'll see what happens.

Cheers,
Ros.

comment:10 Changed 5 months ago by charlie

Hi Ros,

It was too good to be true to expect my suite to run straight away - it failed yesterday evening at the reconfiguration stage. I have taken a look at the error file, and as always there are lots of warnings (including the best ever error I have seen, "info: Terminal type `dumb' is not smart enough to run Info") but I'm not sure what the problem is. At the end of the file is:

lib-4611 : UNRECOVERABLE library error

Missing opening (left) parenthesis in format.

Encountered during a sequential formatted WRITE to an internal file (character variable)
Application 27382592 is crashing. ATP analysis proceeding…
atpFrontend.exe: main: retrieveRawMBT:: recv of BT_HERE_IS_BACKTRACE failed

atpAppSigHandler timed out waiting for shutdown. Re-raising signal.
_pmiu_daemon(SIGCHLD): [NID 06081] [c3-2c2s0n1] [Thu May 17 18:03:35 2018] PE RANK 0 exit signal Aborted
[NID 06081] 2018-05-17 18:03:35 Apid 27382592: initiated application termination
[FAIL] um-recon # return-code=134
2018-05-17T18:03:36Z CRITICAL - Task job script received signal EXIT

But what does this mean?

Charlie

comment:11 Changed 5 months ago by ros

Hi Charlie,

I'm guessing this is caused by a badly formatted write statement somewhere in the UM where it's trying to write an error message about a missing ancillary file. Unfortunately ATP didn't give a backtrace so can't easily see where. :-(

However, if you look in the job.out file at the bottom you will see that the suite failed to find the ancillary file: /projects/ukesm/wilro/cylc-run/u-ar826/share/data/n96e_orca025_go6/orography/globe30/qrparm.orog. Hopefully fixing the path to this ancillary file will get past the error message above.

This suite is setup dangerously in my opinion as it is relying on files in someone else's cylc-run directory. If they rerun this suite, files are likely to change. I would suggest talking to the owner of the suite that you copied to (a) find the file that is missing and (b) check if there is a better location you should be referencing.

Regards,
Ros.

comment:12 Changed 5 months ago by charlie

Thanks Ros. I was actually aware of this dodgy setup, because I would always prefer to have all of the non-standard ancillary files in my own directory, so I know exactly where they are and what's happening to them. I spoke to the original owner about this a while ago, about whether it was worth me copying all of his modified ancillaries (which are needed in the suite), and he thought that wasn't necessary because he wouldn't change any of them. However, perhaps I should, just in case.

Presumably it's just a case of looking in the ancillary versions file, see which one are in his directory (i.e. which ones he has modified) and copying these? I don't need to move all the ancillaries, do I (i.e. the ones he hasn't touched, which are still in the standard directories)?

comment:13 Changed 5 months ago by ros

Presumably it's just a case of looking in the ancillary versions file, see which one are in his directory (i.e. which ones he has modified) and copying these? I don't need to move all the ancillaries, do I (i.e. the ones he hasn't touched, which are still in the standard directories)?

Yes, that should be right.

Cheers
Ros.

comment:14 Changed 5 months ago by charlie

Okay, many thanks. I have now copied all of the non-standard ancillary files (i.e. the ones he modified) into my own directories, and have changed the locations in the ancillary versions file. The problem was that it was looking for qrparm.orog which, although the file was present in his directory, the name had been changed slightly to qrparm.orog_1. So I have copied this one into my own directory, along with all the others, and have changed the name back to qrparm.orog so hopefully it will find it okay.

I have now re-submitted my job, so will let you know what happens…

comment:15 Changed 5 months ago by charlie

Hi Ros,

New error this time, at the atmos_main stage. I don't seem to have a error file or a log file this time, however within the activity log I have found the following problem:

(xcs-c) 2018-05-18T21:48:44Z [STDERR] qsub: Job violates queue and/or server resource limits
[(('event-mail', 'submission failed'), 7) ret_code] 0

Is this the reason it has failed, and if so what does this mean? I thought I was using the correct project code and the correct queue, but perhaps not.

Charlie

comment:16 Changed 5 months ago by grenville

Charlie

this is the problem:

#PBS -l walltime=18000

#PBS -q normal

the normal queue only allows jobs up to 14400 seconds - please see the Queues section of https://collab.metoffice.gov.uk/twiki/bin/view/Support/NEXCS

Grenville

comment:17 Changed 5 months ago by charlie

Many thanks, and sorry I didn't realise that. I have now changed my wall time to 4 hours, 14400 seconds, and have resubmitted, so fingers crossed.

I changed this within the rose editor, but am I correct in thinking that it is also specified in ~/roses/<suiteID>/rose-suite.conf? If so, what are the other versions of this file for e.g. rose-suite.conf_archer, rose-suite.conf_meto_cray, rose-suite.conf_monsoon etc? Or are these just included as standard, depending on which machine I am on? Given that I running it on nexcs, which one do I want?

comment:18 Changed 5 months ago by ros

Hi Charlie,

Yes they are the same thing. The rose editor is just a GUI for editing the various .conf files so you could have just changed the walltime in the rose-suite.conf file if you wanted.

The other versions of this file are just examples of the rose-suite.conf file for other platforms. So in theory if you changed to run on ARCHER for example you could copy the rose-suite.conf_archer file to rose-suite.conf to save you manually changing some the standard platform settings but I wouldn't like to say how up-to-date they actually are! Anyway you can ignore these for what you are doing.

Cheers,
Ros.

comment:19 Changed 5 months ago by charlie

Hi, I left my suite running last night, but having checked today everything seems to be frozen. Is there something wrong with monsoon today? I don't think I missed a maintenance email, but perhaps I did. Either way, the cylc window has frozen, and if I try to logout again login again, to either xcs or exvmsrose, it just freezes. So I have no way of knowing if my suite ran, or indeed how far through its two-year test run it got.

What I can tell, however, is that nothing could be transferred to Jasmin, to my specified nexcs area.

Charlie

comment:20 Changed 5 months ago by ros

Hi Charlie,

Every Tuesday 9:00-11:00 is reserved as a maintenance window for updates and patches; there may be unadvertised disruption at these times.

Cheers,
Ros.

comment:21 Changed 5 months ago by charlie

Thanks Ros, and sorry I realise that. I will give it half an hour and then try again. But, as I said, something can't be exactly right, because surely if my two-year run had worked, it would have transferred the output to Jasmin? It hasn't.

comment:22 Changed 5 months ago by charlie

  • Didn't realise that

comment:23 Changed 5 months ago by ros

Hopefully you've just seen the Yammer alert that's been posted:

"Issues with XCSC - Tuesday 22nd May 2018 11:34
There seems to be an issue with logging onto XCSC and ROSE.
This is being investigated."

comment:24 Changed 5 months ago by ros

Hi Charlie,

Sorry I only added the relevant bits to the *.rc files (points 3 & 4); I thought you had done points 1 & 2 - that would be why it hasn't worked!

Please do points 1 & 2 to add in the controlling PPTRANSFER variable:
http://cms.ncas.ac.uk/wiki/Docs/PostProcessingAppNexcsSetup

When you submit the suite you should see a separate pptransfer task in the GUI which will run after postproc. If you don't then the suite is still not setup correctly.

Cheers,
Ros.

comment:25 Changed 5 months ago by charlie

Hi Ros,

Sorry for the delay, I was in Bristol all day yesterday.

I have now completed steps 1 & 2 as instructed, no problem.

Having looked at my output on NEXCS (at ~/cylc-run/u-aw739/share/data/History_Data), however, although there is some output, it has clearly not run for 2 years. There is only one month's worth of data, and it is August. So, 2 questions:

Firstly, given that my model basis time is 19880901 i.e. September, why is my first output month August? I have tried looking at the actual start dump within ainitial, but this is pointing to the environment variable '$AINITIAL_N96' - where is this defined, so I can look at the actual start dump?

Secondly, what caused it to fall over? I don't seem to have any output and error log files (i.e. ~/cylc-run/u-aw739/log/job/<cycle>/<app>/NN/job.out or job.err), in fact I don't seem to have a ~/cylc-run/u-aw739/log directory at all.

Am I missing something?

Charlie

comment:26 Changed 5 months ago by charlie

Sorry, further to my last message, I do have a ~/cylc-run/u-aw739/log/job/<cycle>/<app>/NN/job.out and job.err).

comment:27 Changed 5 months ago by grenville

Much of the data has been post-processed - please look in /home/d05/cwilliams/cylc-run/u-aw739/log/job/19900301T0000Z/postproc/01/job.out - data has been converted and moved, for example:

Archiving /home/d05/cwilliams/cylc-run/u-aw739/share/data/History_Data/aw739a.da19900301_00 to /projects/nexcs-n02/cwilliams/sweet/ga71/u-aw739/19900301T0000Z

What shows in the cylc window - it appears that the model has run successfully. There are 4 6-month cycles in projects/nexcs-n02/cwilliams/sweet/ga71/u-aw739

You can simply grep -r in the roses suite directory to find $AINITIAL_N96

I am not quite sure why there are only to cycles in the log directory (there are 4 in the work directory)

Grenville

comment:28 Changed 5 months ago by charlie

Thanks Grenville, and sorry, I temporarily forgot the output went here. So do you think, given that I ran it for 2 years starting in September 1988, that it ran successfully? If so, presumably now that I have turned on PPTRANSFER, it should automatically send all of this output over to my specified directory on Jasmin? Does it automatically remove the output on NEXCS, or do I need to do this manually?

And yes - why are there 4 cycles in this directory (which is expected, all containing 6 months), but only 2 in the log directory? And why is there only one month in each, August for the first, for example?

Lastly, in my output work directory, there appears to be all the usual output files e.g. daily etc, but there doesn't appear to be any monthly (*pm) or annual means (*py). I thought I had these being output, based on the standards stash table I am using, but perhaps not?

comment:29 Changed 5 months ago by grenville

Charlie

Your climate meaning sequence is 3,3,3,4 with 30-day dumping — that should be 1,3,4,10 (monthly, seasonal, yearly, decadal) for 30-day dumping.

Pl try that.

Grenville

comment:30 Changed 5 months ago by charlie

Hi Grenville,

Thanks. According to my dumping and meaning sequence (in um > namelist > Model input and output > Dumping and meaning), it wasn't 3,3,3,4 as you say above but rather was 3,3,4,10. Are you looking somewhere else? Either way, I have now changed to 1,3,4,10, as you advised, and have resubmitted my suite.

However, and further to comment 24 above, even though I have completed pptransfer points 1 and 2, I cannot see the separate pptransfer task in the cycl GUI after postproc, as Ros said I should. It just goes straight from postproc to supermean. Ros said she had completed the other steps and had changed my suite.rc as appropriate, but if I look in ~/roses/u-aw739/suite.rc I can't see any mention of pptransfer, which should be there according to the instructions?

Does this mean the automatic transfer to Jasmin is still not set up correctly?

Charlie

comment:31 Changed 5 months ago by ros

Hi Charlie,

I presume after I committed the changes to the suite that you then re-checked it out from the repository to pick up the changes?

If so, can I just confirm that you have submitted the suite to run from the beginning again? ie. rose suite-run or have you just restarted it? If you didn't checkout the suite again that would be why it's still not working. Try running fcm update in the rose suite working copy. This may or may not work if you have since made local changes.

Cheers,
Ros.

comment:32 Changed 5 months ago by charlie

Hi Ros,

No, sorry, I didn't do that after you made the changes. Sorry, my mistake.

I have made a couple of little changes since then, so what should I do with FCM to both pick up your changes and my more recent changes? Shall I just fcm commit again, or will this not pick up your changes?

Charlie

comment:33 Changed 5 months ago by ros

Hi Charlie,

Try running fcm update I think it should cope ok as the changes shouldn't clash.

fcm commit won't work as your copy is out of date with the repository.

Give the above a try and let me know if it doesn't work. It should tell you what it's doing and then just confirm the pptransfer has appeared in the suite.rc file. Happy to take a look at the result to confirm if you want.

Cheers,
Ros.

comment:34 Changed 5 months ago by charlie

Okay, done that. Result as follows:

[cwilliams@exvmsrose:~/roses/u-aw739]$ fcm update
update: status of ".":
M 78239 meta/rose-meta.conf

  • 78239 suite.rc

M 78239 app/um/rose-app.conf
M 78239 rose-suite.conf
update: continue?
Enter "y" or "n" (or just press <return> for "n"): y
Updating '.':
U suite.rc
Updated to revision 79387.

Does this look right?

If so, and just to clarify: if I make any more little changes, do I need to fcm commit or update before resubmitting?

comment:35 Changed 5 months ago by ros

That's looking good. I've just checked the file and it looks fine.

You only need to fcm commit when you want to save your changes to the repository. Whenever you do a run it just picks up what is in your local ~/roses/u-aw739 working copy.

Give it another whirl and see what happens. If pptransfer doesn't appear in the GUI when you start the run something is still off.

Cheers,
Ros.

comment:36 Changed 5 months ago by charlie

Okay, I now can't get it to submit at all! Whatever I do, either within the GUI or at the command line, it's telling me my suite is still running. I have tried shutting this down, and even checking whether there are any existing processes, but nothing works:

[cwilliams@exvmsrose:~/roses/u-aw739]$ rose suite-shutdown
Really shutdown u-aw739 at exvmscylc.monsoon-metoffice.co.uk? [y or n (default)] y
[cwilliams@exvmsrose:~/roses/u-aw739]$ cylc stop 'u-aw739'
[cwilliams@exvmsrose:~/roses/u-aw739]$ ps -flu cwilliams | grep u-aw739
0 S 41090 15169 26251 0 80 0 - 26343 pipe_w 12:06 pts/85 00:00:00 grep u-aw739
[cwilliams@exvmsrose:~/roses/u-aw739]$ rose suite-run
[FAIL] Suite "u-aw739" appears to be running:
[FAIL] Contact info from: "/home/d05/cwilliams/cylc-run/u-aw739/.service/contact"
[FAIL] CYLC_SUITE_HOST=exvmscylc.monsoon-metoffice.co.uk
[FAIL] CYLC_SUITE_OWNER=cwilliams
[FAIL] CYLC_SUITE_PORT=43004
[FAIL] CYLC_SUITE_PROCESS=16827 /usr/bin/python /data/local/fcm/cylc-7.6.0/bin/cylc-run u-aw739
[FAIL] Try "cylc stop 'u-aw739'" first?

comment:37 Changed 5 months ago by ros

The suite processes run on exvmscylc so logon to there and check with ps -flu cwilliams | grep u-aw739.

You may also need to delete the contacts file: ~/cylc-run/u-aw739/.service/contact

Then the suite will be able to start.

comment:38 Changed 5 months ago by charlie

Great, very many thanks. I didn't realise the actual processes ran on a separate machine, but I will remember that from now on. I have now resubmitted my job, and the good news is pptransfer now appears in the list after postproc. I have set it to run for another 2 years, and will let you know if it succeeds and transfers everything to Jasmin as it should….

comment:39 Changed 5 months ago by charlie

Hi Ros,

Okay, it seems to have queued all evening yesterday, but then fallen over almost straightaway at the atmos_main stage. This is strange, because the only thing that has changed since I ran it last week (when it worked and ran for 2 years) is the addition of the pptransfer. But it didn't even get to this stage. The only other thing I changed was the dumping and meaning frequency, but this is what Grenville told me to do in order to get monthly and yearly means (which I didn't last week) - see comments 29 and 30. Even having looked at the error file, I can't see exactly what went wrong?

Charlie

comment:40 Changed 5 months ago by ros

Hi Charlie,

The error message in the job.err file is:

?  Error code: 1
?  Error from routine: INITIAL_4A
?  Error message: INITMEAN: Invalid atmos mean frequency
?  Error from processor: 506
?  Error number: 17

I believe a change went in somewhere in the mid 10.x releases that means you can't have the meaning combination you currently have set. To get the monthly means you will need to change the dumping frequency to every 10 days and then set the climate meaning periods to 3,3,4,10

Cheers,
Ros.

comment:41 Changed 5 months ago by charlie

But that's what it was set to! Or, at least, that's what Grenville said it was sent to - see comment 29, above. When I checked in the Dumping and meaning window within Rose, it was actually set to 3,3,4,10 (see comment 30, above), so was Grenville looking somewhere else to get 3,3,3,4?

Either way, he told me to change it to 1,3,4,10 which I did.

If I now change it back to 3,3,4,10 which it was before, this doesn't seem to give me any monthly or annual means (i.e. my output doesn't contain any *pm or *py files).

comment:42 Changed 5 months ago by ros

Grenville said to change it to 1,3,4,10 because you have 30 day dumping set.

If you please change the dumping to every 10 days (slightly further up the same panel) then the climate meaning periods need to be 3,3,4,10 to give you 3x10days (monthly), 3xmonthly (seasonal), etc.

Cheers,
Ros.

comment:43 Changed 5 months ago by charlie

Okay, I have just done that, so dumping frequency is set to 10 days and then using 3,3,4,10. I have just resubmitted, so fingers crossed.

And I think I have finally understood how the dumping frequency works - am I right in thinking that the order of the entry boxes is month, season, year, decade, then for the first box it is whatever number multiplied by the dumping frequency and then for the other boxes it is whatever number multiplied by whatever the box represents? So for the first box, 3*10 days = 30 days = month, then for the 2nd box 3*month = season, then for the 3rd box 4*season = year and finally for the last box 10*year = decade. Is that right?!

comment:44 Changed 5 months ago by ros

Yes, that is absolutely correct. :-)

comment:45 Changed 5 months ago by charlie

Hi Ros,

Sorry, quick question: I submitted my suite yesterday morning, but it is still queueing - the first few stages went through very quickly, but then recon (submitted at 1102 yesterday) didn't actually run until 0426 this morning. It took just a few seconds to run, and now atmos_main is queueing.

Is this normal? If it really does take this long in the queue each cycle, it's going to take forever to do even a short 50 year run. Is there anything we can do about this?

Charlie

comment:46 Changed 5 months ago by charlie

Hi,

Further to my last message, my first cycle has of 6 months now ALMOST finished. I am a little bit worried that it has got to the pptransfer stage and says "retrying" - does this imply it's having problems and is about to fail, or am I worrying unnecessarily?

Either way, based on my first cycle I am getting the following speed for each stage (unless queueing time is specified below, it is only seconds):

fcm_make_um: 1 sec
fcm_make_pp: 15 secs
fcm_make2_pp: 6 secs
install_ancil: 1 sec
recon: 38 secs (however queued for ~17.5 hrs)
atmos_main: 133 mins = 2 hrs 13 mins (however queued for ~7.5 hrs)
supermean: "waiting"
rose_arch_logs: "waiting"
housekeeping: "waiting"
postprocessing: 6 mins
pptransfer: "retrying"

This gives a total of ~140 minutes (or just under 2.5 hours) not including the tasks still waiting. In terms of runtime, that's not bad: 1 model year per 5 hours, which = almost 5 model years per day. However, when you include the queueing time, it has taken just over 27 hours to do 6 months! Is this really what we have to live with?

Charlie

comment:47 Changed 5 months ago by ros

Hi Charlie,

To answer your first question. A status of retrying can mean a few things; for example there's a problem that needs fixing or that it hit a temporary issue and the next submission will succeed fine. When you see a task retrying you should always check the log files to see what happened. If you look at the 01/job.err file you will see that there is a problem connecting to JASMIN because your ssh key is too open.

Please change the permissions on your jasmin key id_rsa_jasmin to be 600 (chmod 600 /home/d05/cwilliams/.ssh/id_rsa_jasmin)

Confirm that you can ssh into JASMIN ok and then retrigger the pptransfer task.

Regarding the queueing time. I was looking at the stats this morning and in the last couple of days there has been a spike due to a burst of activity. Hopefully this will settle down again but I will keep an eye on it. Once you are absolutely sure your run is setup correctly and transfer the data ok, you could submit to the 12 or 24 hour queue.

Cheers,
Ros.

comment:48 Changed 5 months ago by charlie

Thanks Ros, I have now changed permissions as you suggested, tried logging in (which works fine) and retriggered the task. It is now running. I'll let you know as soon as the task has finished, and have checked it has transferred correctly, hopefully within the next hour.

comment:49 Changed 5 months ago by charlie

Hi Ros,

Okay then, excellent news - it appears to have worked. My 1st cycle is finished (apart from rose_arch_logs which is still waiting - what does this do?), and my 2nd cycle is now queueing. All output from /projects/nexcs-n02/cwilliams/sweet/ga71/u-aw739 appears to have been transferred successfully to /group_workspaces/jasmin2/nexcs/cwilliams2011/sweet/ga71/u-aw739 on JASMIN.

Quick question about this: the data appear to be still on NEXCS under the projects directory as well, so do I need to manually remove these every so often or does it do it automatically? It would appear not. If I do need to remove the data every so often, do I have space restrictions on this directory and if so how much? At the moment, 6 months worth is taking up ~15G, so how often should I be removing data? Also, another file (that I created myself when looking at some of the output this afternoon) also seems to have been transferred over by pptransfer, so am I right in thinking that this task literally copies everything in that cycle directory over to JASMIN, regardless of where it has come from (i.e. created by the model or by me)?

The pptransfer task only took 6 minutes in the end so, assuming rose_arch_logs doesn't take more than a few minutes, I am looking at about 145 minutes (or 2.5 hours) per 6 months, not including queueing time of course.

Therefore, given these times, what would a suitable cycling and wall clock time be, and in which queue? As I said, I would like to run this suite for 50 years for the time being. At the moment, I am using the normal queue and cycling for 6 months with a wall clock time of 4 hours. But given that it is doing 6 months in ~2.5 hours, would it be worth doubling everything i.e. having 1 year cycling (which should take ~5 hours) with a wall clock time of 8 hours and using the long12 queue? Or even more, perhaps a cycling of 2 years (which should take ~10 hours) with a wall clock time of 12 hours and using the long24 queue? Which of those 2 has a better turnaround i.e. less queueing time, and which would be most efficient?

Charlie

comment:50 Changed 5 months ago by ros

Hi Charlie,

Please switch off the rose_arch_logs task. This is setup up Met Office specific and will attempt to archive the log directory to MASS.

The pptransfer task currently does not delete the transferred data off the /projects workspace. You will need to do this manually. Automatic deletion is on the list of thing to add but I've not had time to do this yet. NEXCS as a whole has a quota of disk (TBs) but currently individual users don't. You'll need to clear out regularly so as not to leave large amounts of output laying around. The transfer copies over everything that was effectively "archived" by the post-processing app. So if you add further files in yourself these will be transferred over as well. It doesn't do any check on the types of files being transferred.

Yes you could try 2 years with a wallclock of 12hours in the long12 queue and see how that goes.

Given the amount of data the model is producing (ie. not that much) post-processing of 4 years' data looks like it should complete within 4hours (which is the maximum allowed on the shared nodes) so you should also be ok to do 4year cycling with 24hour wallclock in the long24 queue. Obviously do check my calculations here!

Cheers,
Ros.

comment:51 Changed 5 months ago by charlie

Hi Ros,

Okay, in terms of tasks: within Rose (in suite conf > Tasks) I currently have "Archive UM wallclock times" as true, and "Archive UM output logs" as false. Which one of these corresponds to rose_arch_log, or is this switched off somewhere else? Also here I note that "supermeans" is also set to false, so why is it appearing in my list of tasks within Cylc?

Also, having checked it this morning, it appears to be on its 3rd cycle (which was submitted at 2215 last night but only started at 0901 this morning), however it appears to have frozen i.e. the timer within Cylc hasn't changed for the last 10 minutes?

All understood about pptransfer, when I start running properly I will make sure I clear out the /projects directory everyday.

Given that it appears to be transferring correctly, and the fact that it has frozen anyway, shall I kill my current test suite and begin the proper one, trying (as you say) 4 years cycling with 24-hour wall clock and the long24 queue? I have checked the timings, and I think this would be okay - 4 years runtime should take ~20 hours, and processing/transferring appears to be taking 12 minutes each cycle.

Charlie

comment:52 Changed 5 months ago by ros

Hi Charlie,

You can turn off both the "Archive UM wallclock times" and "Archive UM output logs" tasks. Supermeans still appears in the graph eventhough it's set to false just because of the way the suite is set up (I don't like this approach as it is confusing!). If you look at the supermean job.out file you will see that it just contains a message saying "Supermeans are turned off" or words to that effect.

Sometimes the GUI does freeze. Usually just starting it back up again (rose sgc) fixes the problem.

Yup. Sounds like you are all ready to go full steam ahead. Try the 4year cycling and take a look at the end of the first cycle just to make sure everything is as you expect.

Cheers,
Ros.

comment:53 Changed 5 months ago by charlie

Okay Ros, understood, I will turn both of those off.

I did try restarting the GUI, because it has often frozen on me before, by shutting it down and then restarting using rose sgc, but it reappeared still frozen. Is there a way of checking if the model is running elsewhere, e.g. something equivalent to Archer's qstat?

Either way, I will kill it now and get started. I'll keep you posted…

comment:54 Changed 5 months ago by charlie

Okay, before that, quick question, how do I change the queue? I thought this was in my rose_suite.conf but the only option I can see here is HPC_QUEUE (which is also in Rose) with the only options being normal, high or urgent? Or is it defined somewhere else?

comment:55 Changed 5 months ago by ros

Ah yes you won't be able to do that through the GUI unless you change the metadata. Easiest is just to edit the rose-suite.conf file direct and change HPC_QUEUE='normal' to HPC_QUEUE='long24'.

Last edited 5 months ago by ros (previous) (diff)

comment:56 Changed 5 months ago by charlie

Okay, done that. So in my rose-suite.conf I have:

CLOCK='PT24H'
HPC_QUEUE='long24'
RUNLEN='other'
RUNLEN_OTHER='P50Y'

among other things, and my cycling (which doesn't seem to be mentioned in this file, but is set within Rose) is set to 4 years i.e. Cycling frequency = P4Y. That's right, isn't it?

It's very confusing (to me, at least) the way multiple variables appear to be referred to by different terms, depending on whether you are looking at the various guides, Rose itself or the individual files. Why can't it all be consistent?!

Charlie

comment:57 Changed 5 months ago by charlie

Ok, I have now submitted it and it is queueing, so I will let you know what happens.

In the meantime, am I right in thinking that if I want to make an exact copy of this suite, I need to commit it to the repository first? Because otherwise, if I just copy, it will revert back to a previous version which might not have all of the little last-minute changes I have made this morning? I tried doing this, but get the following error:

[cwilliams@exvmsrose:~]$ cd roses/u-aw739/
[cwilliams@exvmsrose:~/roses/u-aw739]$ fcm commit
[FAIL] svn status —ignore-externals —show-updates # rc=1
[FAIL] svn: E215004: Authentication failed and interactive prompting is disabled; see the —force-interactive option
[FAIL] svn: E215004: Unable to connect to a repository at URL 'https://code.metoffice.gov.uk/svn/roses-u/a/w/7/3/9/trunk'
[FAIL] svn: E215004: No more credentials or we tried too many times.
[FAIL] Authentication failed

What did I do wrong?

Charlie

comment:58 Changed 5 months ago by charlie

Hi again,

My suite has got to the atmos_main stage, but is retrying - the activity log gives the following line, which is suspicious:

(xcs-c) 2018-06-01T12:46:52Z [STDERR] qsub: Job violates queue and/or server resource limits

I didn't think it did - I'm using long24 with a wallclock time of 24 hours and cycling of 4 years, which I thought should be okay?

comment:59 Changed 5 months ago by ros

Hi Charlie,

I suspect you had the Rose GUI open when you edited the rose-suite.conf file to change queue to long24 as it's reverted back to normal.

Regarding the fcm commit error. If you've not sorted this out yet. Try re-caching your MOSRS password. Either run mosrs-cache-password or logout and back into exvmsrose.

Cheers,
Ros.

comment:60 Changed 5 months ago by charlie

Well that's annoying! Yes, you're right, it had - I have now closed the editor, changed it once again, and resubmitted - using the long queue seems to be a lot better, as it has gone straightaway to the atmos_main stage and is now running. Its giving me an estimated time of 24 hours, but presumably this is just because that's what I requested - by my calculations it shouldn't take this long to do 4 years.

And yes, the fcm error was solved by doing what you suggested, I didn't realise that could happen.

So, now I have committed it, does that mean that if I copy it to another suite ID, the new suite will be identical to my current one? If so that's great. Attached to this is another question - is there a limit on NEXCS as to how many jobs we are allowed to run at once? It used to be 4 on Archer.

Charlie

comment:61 Changed 5 months ago by charlie

Hi Ros,

I realise it's a Saturday so you probably won't get this, but if you do I would be grateful for a response. My suite seems to have successfully run its first cycle and is already running the next, but the postproc stage from the first is just "retrying". Having looked at the output, it would seem that this is because the wall clock time for this task is 1 hour whereas the task itself has taken ever so slightly more (1 hour 1 minute). Is there anyway I can change this? I was under the impression, obviously incorrectly, that each task has its own wall clock time and these are defined in one of the files - either suite.rc, or rose-suite.conf, or similar - but I have checked all of these and can't seem to find it. Please can you advise? If there is a file in which to find this, can it be changed and then the task re-triggered without disturbing the 2nd cycle which is currently running?

Charlie

comment:62 Changed 5 months ago by charlie

Hi again,

Further to my last message, it appears to have resolved itself. The postprocessing task was still retrying when I wrote to you, but then it ran again and only took 18 minutes this time, so succeeded (as did pptransfer, afterwards). Why would it take so much less time the 2nd time round?

Charlie

comment:63 Changed 5 months ago by ros

Hi Charlie,

The wall time for the postproc task is set in the site/MONSooN.rc file. Under [[POSTPROC_RESOURCE]] change the execution time limit = PT1H line.

Once you've done this just do a rose suite-run --reload to get the suite to pick up the change and the next time postproc runs it will use the new time limit.

postproc took less time the second time around because it didn't redo everything it had done in the first attempt. It will just try and continue on. We can't guarantee that this will always work so please do change the wall clock to allow a longer time.

Cheers,
Ros.

comment:64 Changed 5 months ago by charlie

Thanks Ros - what would a suitable time limit be to change this to, perhaps 2 hours? And once I have done this, can I do the rose suite-run —reload whilst other tasks are running (i.e. if it isn't retrying)?

On an entirely different matter, this suite is currently running fine, however a 2nd I started yesterday (u-ay314) has failed at the recon stage. I have checked all of the various log files, but can't see any obvious error. What has gone wrong this time? The suite is identical to my currently working one, with the only difference being the ancillary files going into it (which I have carefully checked, and thought they were all okay).

Charlie

comment:65 Changed 5 months ago by ros

Hi Charlie,

Give 2 hours a go. You can further adjust the wallclock as required.
You can do rose suite-run --reload while the other tasks are running. The reload will only affect tasks that haven't yet been submitted/started to run.

I've started a new ticket #2480 for the problems you are now getting with u-ay314.

comment:66 Changed 5 months ago by charlie

Okay Ros, many thanks, I have now made that change and reloaded it.

And sorry for the other question, I wasn't sure whether to start a new ticket (as theoretically it is an entirely new suite) or just continue with this one.

While I remember - did you see my other question (somewhere above) about what is a reasonable level of usage? It used to be 4 suites at any one time, so is this still the case? Or is that considered greedy? Apart from looking in the log files, how can I tell how much resources I am using and how much I have left? When I asked Grenville about this before, he said that nexcs-n02 doesn't have user allocated resources any more, meaning I can't use the old budgets, so what is the best way of keeping track of how much I'm using (and whether this is more than I should be)?

Charlie

comment:67 Changed 5 months ago by charlie

Sorry Ros, but having managed to get back onto monsoon today, I have found my suite (u-aw739) has failed. It managed to get through 5 cycles (i.e. 20 years) then failed on the 6th cycle at the atmos_main stage. Why would it fail now, after getting so far? I have had a look at all the log files but nothing is obvious, apart from possibly:

? Error message: Invalid time record 2, End-of-file reached ?

Is this the problem, or something else? I wondered whether it was because /projects was full (because I haven't cleared anything off in the last 24 hours) but it can't be that because my results directory is virtually empty?

Charlie

comment:68 Changed 5 months ago by ros

Hi Charlie,

I'm just about to head home, but as a starter for ten this error usually means you have an ancillary file that doesn't have enough data in it. I.e. the model has just reached 2010-12-01 and I would guess there is at least one ancillary file that only contains data up to 2010 after which the model will fail.

I'll answer the questions on your previous comment tomorrow…

Cheers,
Ros.

Last edited 5 months ago by ros (previous) (diff)

comment:69 Changed 5 months ago by ros

P.S. Just having caught a glance of the stack trace - I think it's the UKCA emissions files.

comment:70 Changed 5 months ago by charlie

Bugger. I thought, clearly incorrectly, that all of my ancillary files were 12 monthly climatologies, not actual timeseries. But given that I didn't check the UKCA emissions files, because I am using standard ones rather than anything I have modified, I didn't check these.

Is there any way around this?

comment:71 Changed 4 months ago by luke

Hi Charlie,

Would you rather have 12-month repeating UKCA climatologies? If so, these can be found here:

/projects/um1/ancil/atmos/n96e/ukca_emiss

A full set of emissions files would look like this:

ukca_em_files='$UM_NETCDF_UKCAEMISS_C2H6_DIR/$UM_NETCDF_UKCAEMISS_C2H6_FILE',
             ='$UM_NETCDF_UKCAEMISS_C3H8_DIR/$UM_NETCDF_UKCAEMISS_C3H8_FILE',
             ='$UM_NETCDF_UKCAEMISS_C5H8_DIR/$UM_NETCDF_UKCAEMISS_C5H8_FILE',
             ='$UM_NETCDF_UKCAEMISS_CH4_DIR/$UM_NETCDF_UKCAEMISS_CH4_FILE',
             ='$UM_NETCDF_UKCAEMISS_CO_DIR/$UM_NETCDF_UKCAEMISS_CO_FILE',
             ='$UM_NETCDF_UKCAEMISS_NO_DIR/$UM_NETCDF_UKCAEMISS_NO_FILE',
             ='$UM_NETCDF_UKCAEMISS_HCHO_DIR/$UM_NETCDF_UKCAEMISS_HCHO_FILE',
             ='$UM_NETCDF_UKCAEMISS_MECHO_DIR/$UM_NETCDF_UKCAEMISS_MECHO_FILE',
             ='$UM_NETCDF_UKCAEMISS_ME2CO_DIR/$UM_NETCDF_UKCAEMISS_ME2CO_FILE',
             ='$UM_NETCDF_UKCAEMISS_NVOC_DIR/$UM_NETCDF_UKCAEMISS_NVOC_FILE',
             ='$UM_NETCDF_UKCAEMISS_NOAIR_DIR/$UM_NETCDF_UKCAEMISS_NOAIR_FILE',
             ='$UM_NETCDF_UKCAEMISS_NH3_DIR/$UM_NETCDF_UKCAEMISS_NH3_FILE',
             ='$UM_NETCDF_UKCAEMISS_BCBIOF_DIR/$UM_NETCDF_UKCAEMISS_BCBIOF_FILE',
             ='$UM_NETCDF_UKCAEMISS_BCFOSS_DIR/$UM_NETCDF_UKCAEMISS_BCFOSS_FILE',
             ='$UM_NETCDF_UKCAEMISS_DMS_DIR/$UM_NETCDF_UKCAEMISS_DMS_FILE',
             ='$UM_NETCDF_UKCAEMISS_MONOTP_DIR/$UM_NETCDF_UKCAEMISS_MONOTP_FILE',
             ='$UM_NETCDF_UKCAEMISS_OCBIOF_DIR/$UM_NETCDF_UKCAEMISS_OCBIOF_FILE',
             ='$UM_NETCDF_UKCAEMISS_OCFOSS_DIR/$UM_NETCDF_UKCAEMISS_OCFOSS_FILE',
             ='$UM_NETCDF_UKCAEMISS_SO2HI_DIR/$UM_NETCDF_UKCAEMISS_SO2HI_FILE',
             ='$UM_NETCDF_UKCAEMISS_SO2LOW_DIR/$UM_NETCDF_UKCAEMISS_SO2LOW_FILE',
             ='$UM_NETCDF_UKCAEMISS_SO2NAT_DIR/$UM_NETCDF_UKCAEMISS_SO2NAT_FILE',
             ='$UM_NETCDF_UKCAEMISS_BCBIOM_DIR/$UM_NETCDF_UKCAEMISS_BCBIOM_FILE',
             ='$UM_NETCDF_UKCAEMISS_OCBIOM_DIR/$UM_NETCDF_UKCAEMISS_OCBIOM_FILE'

although this also includes the chemistry. Is it just the aerosol emissions files you need?

Can you check in the install_ancil app and see which ancil versions file you are using? This will look something like

[command]
default=true

[env]
ANCILRES=n96e_orca025
ANCILREV=''
ANCILROOT=$UMDIR/ancil/data/ancil_versions
ANCILVN=GA7.0_AMIP/v2

[file:$ROSE_DATA/etc/um_ancils_gl]
source=$ANCILROOT/$ANCILRES/$ANCILVN/ancils$ANCILREV

i.e. looking in the standard file at

/projects/um1/ancil/data/ancil_versions/n96e_orca025/GA7.0_AMIP/v2/ancils

this has the UKCA-specific files as

# UKCA-GLOMAP mode ancillaries (ancillary/netcdf format)

# Initial conditions (not really an ancil, but extracted from a dump)
export UM_ANCIL_MODEINIT_DIR=$UM_ANCIL_N96EDIR/ukca_init/mi-ah615_September/v1

# directories for netcdf emissions for UKCA (tropospheric aerosols, offline oxidants)
export UM_NETCDF_UKCAEMISS_BCBIOF_DIR=$UM_ANCIL_N96EDIR/ukca_emiss/cmip5/2000/v2
export UM_NETCDF_UKCAEMISS_BCFOSS_DIR=$UM_ANCIL_N96EDIR/ukca_emiss/cmip5/2000/v2
export UM_NETCDF_UKCAEMISS_DMS_DIR=$UM_ANCIL_N96EDIR/ukca_emiss/cmip5/1970_2010/v2
export UM_NETCDF_UKCAEMISS_MONOTP_DIR=$UM_ANCIL_N96EDIR/ukca_emiss/cmip5/2000/v2
export UM_NETCDF_UKCAEMISS_OCBIOF_DIR=$UM_ANCIL_N96EDIR/ukca_emiss/cmip5/2000/v2
export UM_NETCDF_UKCAEMISS_OCFOSS_DIR=$UM_ANCIL_N96EDIR/ukca_emiss/cmip5/2000/v2
export UM_NETCDF_UKCAEMISS_SO2HI_DIR=$UM_ANCIL_N96EDIR/ukca_emiss/cmip5/1970_2010/v2
export UM_NETCDF_UKCAEMISS_SO2LOW_DIR=$UM_ANCIL_N96EDIR/ukca_emiss/cmip5/1970_2010/v2
export UM_NETCDF_UKCAEMISS_SO2NAT_DIR=$UM_ANCIL_N96EDIR/ukca_emiss/andres_kasgnoc/v1
export UM_NETCDF_UKCAEMISS_BCBIOM_DIR=$UM_ANCIL_N96EDIR/ukca_emiss/gfed3.1/clim_2002_2011/v2
export UM_NETCDF_UKCAEMISS_OCBIOM_DIR=$UM_ANCIL_N96EDIR/ukca_emiss/gfed3.1/clim_2002_2011/v2

# directories for additional netcdf emissions for UKCA StratTrop Chemical scheme
export UM_NETCDF_UKCAEMISS_C2H6_DIR=$UM_ANCIL_N96EDIR/ukca_emiss/cmip5/2000/v2
export UM_NETCDF_UKCAEMISS_C3H8_DIR=$UM_ANCIL_N96EDIR/ukca_emiss/cmip5/2000/v2
export UM_NETCDF_UKCAEMISS_C5H8_DIR=$UM_ANCIL_N96EDIR/ukca_emiss/cmip5/2000/v2
export UM_NETCDF_UKCAEMISS_CH4_DIR=$UM_ANCIL_N96EDIR/ukca_emiss/cmip5/2000/v2
export UM_NETCDF_UKCAEMISS_CO_DIR=$UM_ANCIL_N96EDIR/ukca_emiss/cmip5/2000/v2
export UM_NETCDF_UKCAEMISS_NO_DIR=$UM_ANCIL_N96EDIR/ukca_emiss/cmip5/2000/v2
export UM_NETCDF_UKCAEMISS_NVOC_DIR=$UM_ANCIL_N96EDIR/ukca_emiss/cmip5/2000/v2
export UM_NETCDF_UKCAEMISS_HCHO_DIR=$UM_ANCIL_N96EDIR/ukca_emiss/cmip5/2000/v2
export UM_NETCDF_UKCAEMISS_MECHO_DIR=$UM_ANCIL_N96EDIR/ukca_emiss/cmip5/2000/v2
export UM_NETCDF_UKCAEMISS_ME2CO_DIR=$UM_ANCIL_N96EDIR/ukca_emiss/cmip5/2000/v2
export UM_NETCDF_UKCAEMISS_NOAIR_DIR=$UM_ANCIL_N96EDIR/ukca_emiss/cmip5/2000/v2
export UM_NETCDF_UKCAEMISS_NH3_DIR=$UM_ANCIL_N96EDIR/ukca_emiss/cmip5/1970_2010/v2

###################################################################
# directories for netcdf oxidants (used in offline oxidants runs)
export UM_NETCDF_UKCAOXID_O3_DIR=$UM_ANCIL_N96EDIR/oxidants/ccmi_refc1_anqdg/clim_1988_2010/v1
export UM_NETCDF_UKCAOXID_OH_DIR=$UM_ANCIL_N96EDIR/oxidants/ccmi_refc1_anqdg/clim_1988_2010/v1
export UM_NETCDF_UKCAOXID_NO3_DIR=$UM_ANCIL_N96EDIR/oxidants/ccmi_refc1_anqdg/clim_1988_2010/v1
export UM_NETCDF_UKCAOXID_H2O2_DIR=$UM_ANCIL_N96EDIR/oxidants/ccmi_refc1_anqdg/clim_1988_2010/v1
export UM_NETCDF_UKCAOXID_HO2_DIR=$UM_ANCIL_N96EDIR/oxidants/ccmi_refc1_anqdg/clim_1988_2010/v1

so you should be able to change the paths for the files that don't point to the 2000 files (i.e. the ones pointing to the 1970_2010 directory, which are DMS, all SO2 (except possibly SO2NAT), and NH3. The BC and OC look to be climatologies, as are the oxidants.

Note though that these emissions and oxidant fields are for ~2000. For running the Eocene you'd want at least a set of pre-Industrial ones. For the emissions these should be available (at some point) for UKESM1/GC3.1 from the Met Office, although the oxidants are still ~2000 I think.

Thanks,
Luke

Changed 4 months ago by charlie

Changed 4 months ago by charlie

comment:72 Changed 4 months ago by charlie

Hi Luke,

Thanks for your message.

So… here is the list of emissions files which are actually being used in my current suite on Monsoon:

ukca_em_files='$UM_NETCDF_UKCAEMISS_BCBIOF_DIR/$UM_NETCDF_UKCAEMISS_BCBIOF_FILE',

='$UM_NETCDF_UKCAEMISS_BCFOSS_DIR/$UM_NETCDF_UKCAEMISS_BCFOSS_FILE',
='$UM_NETCDF_UKCAEMISS_DMS_DIR/$UM_NETCDF_UKCAEMISS_DMS_FILE',
='$UM_NETCDF_UKCAEMISS_MONOTP_DIR/$UM_NETCDF_UKCAEMISS_MONOTP_FILE',
='$UM_NETCDF_UKCAEMISS_OCBIOF_DIR/$UM_NETCDF_UKCAEMISS_OCBIOF_FILE',
='$UM_NETCDF_UKCAEMISS_OCFOSS_DIR/$UM_NETCDF_UKCAEMISS_OCFOSS_FILE',
='$UM_NETCDF_UKCAEMISS_SO2HI_DIR/$UM_NETCDF_UKCAEMISS_SO2HI_FILE',
='$UM_NETCDF_UKCAEMISS_SO2LOW_DIR/$UM_NETCDF_UKCAEMISS_SO2LOW_FILE',
='$UM_NETCDF_UKCAEMISS_SO2NAT_DIR/$UM_NETCDF_UKCAEMISS_SO2NAT_FILE',
='$UM_NETCDF_UKCAEMISS_BCBIOM_DIR/$UM_NETCDF_UKCAEMISS_BCBIOM_FILE',
='$UM_NETCDF_UKCAEMISS_OCBIOM_DIR/$UM_NETCDF_UKCAEMISS_OCBIOM_FILE'

As you can see, this is significantly shorter than the list you gave, presumably because it contains just the aerosol emissions and not chemistry. Are the chemistry files therefore included, and listed, somewhere else or does the above reply they are not being used in my suite?

From looking at each of the above, all of them appear to be 12 monthly climatologies exempt 3 of them, namely:

$UM_NETCDF_UKCAEMISS_MONOTP_FILE',
$UM_NETCDF_UKCAEMISS_SO2HI_FILE',
$UM_NETCDF_UKCAEMISS_SO2LOW_FILE',

These appear to be monthly timeseries going from 1970-2010. So presumably this is why my suite has failed, because it has tried to go beyond 2010. Also, I note that I can open and view all of these files no problem using xconv, except $UM_NETCDF_UKCAEMISS_SO2NAT_FILE which gives me an error, so might this also be the reason it's falling over?

I have attached my Eocene ancillary versions file to this email so you can see where these actually point to (the ancillaries highlighted in red are those that I have modified).

So it's the aerosols which are causing the problem. I had hoped to be able to run my current suite, which uses the ancillary files in the attached i.e. modern aerosols but Eocene everything else, for 50 years, just to get an idea of what happens. The next stage would then be to get the preindustrial aerosols, and run with these instead, but removing all anthropogenic aerosols and modifying the natural ones to be Eocene.

However, it looks like I can't do the first step. So, onto the preindustrial aerosols. I already have a list of these, currently on Archer:

ukca_em_files='$CMIP6_ANCILS/n96e/timeslice_1850/AerosolChemistryEmissions/v1/BC_biofuel_1850_time_slice.nc', ='$CMIP6_ANCILS/n96e/timeslice_1850/AerosolChemistryEmissions/v1/BC_fossil_1850_time_slice.nc',
='$UM_NETCDF_UKCAEMISS_DMS_DIR/$UM_NETCDF_UKCAEMISS_DMS_FILE',
='$UM_NETCDF_UKCAEMISS_MONOTP_DIR/$UM_NETCDF_UKCAEMISS_MONOTP_FILE',
='$CMIP6_ANCILS/n96e/timeslice_1850/AerosolChemistryEmissions/v1/OC_biofuel_1850_time_slice.nc',
='$CMIP6_ANCILS/n96e/timeslice_1850/AerosolChemistryEmissions/v1/OC_fossil_1850_time_slice.nc',
='$CMIP6_ANCILS/n96e/timeslice_1850/AerosolChemistryEmissions/v1/SO2_high_1850_time_slice.nc',
='$CMIP6_ANCILS/n96e/timeslice_1850/AerosolChemistryEmissions/v1/SO2_low_1850_time_slice.nc',
='$UM_NETCDF_UKCAEMISS_SO2NAT_DIR/$UM_NETCDF_UKCAEMISS_SO2NAT_FILE',
='$CMIP6_ANCILS/n96e/timeslice_1850/AerosolChemistryEmissions/v1/BC_biomass_high_1850_time_slice.nc',
='$CMIP6_ANCILS/n96e/timeslice_1850/AerosolChemistryEmissions/v1/BC_biomass_low_1850_time_slice.nc',
='$CMIP6_ANCILS/n96e/timeslice_1850/AerosolChemistryEmissions/v1/OC_biomass_high_1850_time_slice.nc',
='$CMIP6_ANCILS/n96e/timeslice_1850/AerosolChemistryEmissions/v1/OC_biomass_low_1850_time_slice.nc'

These are the files that are currently being used in another simulation of mine, for the mid-Holocene. As you can see, most of these are preindustrial (i.e. 1850) except 3, namely:

$UM_NETCDF_UKCAEMISS_DMS_FILE',
$UM_NETCDF_UKCAEMISS_MONOTP_FILE',
$UM_NETCDF_UKCAEMISS_SO2NAT_FILE,

but all of them are 12 monthly climatologies except $UM_NETCDF_UKCAEMISS_SO2NAT_FILE which appears to have one timeslice only (an annual mean perhaps?).

So, 3 questions:

Firstly, could I change my Eocene suite, so that instead of using the Monsoon files:

$UM_NETCDF_UKCAEMISS_MONOTP_FILE',
$UM_NETCDF_UKCAEMISS_SO2HI_FILE',
$UM_NETCDF_UKCAEMISS_SO2LOW_FILE',

I instead use the Archer files:

$UM_NETCDF_UKCAEMISS_MONOTP_FILE',
$UM_NETCDF_UKCAEMISS_SO2NAT_FILE,

given that these are not timeseries? That way, I could at least get my current suite running.

Secondly, of the above list of preindustrial control aerosols, which ones can be considered to be anthropogenic and which ones are natural? Am I right in thinking that the only actual natural ones are the DMS, MONOTP and SO2NAT?

Lastly, again you will notice that the above list of preindustrial control aerosols are much shorter than your list, presumably because they don't contain chemistry? Again, is this because the chemistry files are listed elsewhere, or does this imply they are not being used?

Many thanks

Charlie

comment:73 Changed 4 months ago by luke

Hi Charlie,

You won't be running with chemistry, so sorry for confusing things there (although they are helpful as examples).

In your suite with present day emissions you could just point to the year 2000 files and it "should just work". A full list of available emissions files in the directory I mentioned is:

./andres_kasgnoc/v1/ukca_emiss_SO2_nat.nc
./aerocom/v1/ukca_emiss_SO2_nat.nc
./aerocom/v1/README.md
./cmip5/2000/v1/ukca_emiss_C3H8.nc
./cmip5/2000/v1/ukca_emiss_Monoterp.nc
./cmip5/2000/v1/ukca_emiss_CO.nc
./cmip5/2000/v1/ukca_emiss_NVOC.nc
./cmip5/2000/v1/ukca_emiss_OC_biomass.nc
./cmip5/2000/v1/ukca_emiss_SO2_low.nc
./cmip5/2000/v1/ukca_emiss_HCHO.nc
./cmip5/2000/v1/ukca_emiss_BC_fossil.nc
./cmip5/2000/v1/ukca_emiss_MeCHO.nc
./cmip5/2000/v1/ukca_emiss_BC_biofuel.nc
./cmip5/2000/v1/ukca_emiss_NO.nc
./cmip5/2000/v1/ukca_emiss_C2H6.nc
./cmip5/2000/v1/ukca_emiss_SO2_high.nc
./cmip5/2000/v1/ukca_emiss_DMS.nc
./cmip5/2000/v1/ukca_emiss_Me2CO.nc
./cmip5/2000/v1/ukca_emiss_NO_aircrft.nc
./cmip5/2000/v1/ukca_emiss_NH3.nc
./cmip5/2000/v1/ukca_emiss_OC_fossil.nc
./cmip5/2000/v1/ukca_emiss_OC_biofuel.nc
./cmip5/2000/v1/ukca_emiss_CH4.nc
./cmip5/2000/v1/ukca_emiss_C5H8.nc
./cmip5/2000/v1/ukca_emiss_BC_biomass.nc
./cmip5/2000/v2/ukca_emiss_C3H8.nc
./cmip5/2000/v2/ukca_emiss_Monoterp.nc
./cmip5/2000/v2/ukca_emiss_CO.nc
./cmip5/2000/v2/ukca_emiss_NVOC.nc
./cmip5/2000/v2/ukca_emiss_OC_biomass.nc
./cmip5/2000/v2/ukca_emiss_SO2_low.nc
./cmip5/2000/v2/ukca_emiss_HCHO.nc
./cmip5/2000/v2/ukca_emiss_BC_fossil.nc
./cmip5/2000/v2/ukca_emiss_MeCHO.nc
./cmip5/2000/v2/ukca_emiss_BC_biofuel.nc
./cmip5/2000/v2/ukca_emiss_NO.nc
./cmip5/2000/v2/ukca_emiss_C2H6.nc
./cmip5/2000/v2/ukca_emiss_SO2_high.nc
./cmip5/2000/v2/ukca_emiss_DMS.nc
./cmip5/2000/v2/ukca_emiss_Me2CO.nc
./cmip5/2000/v2/ukca_emiss_NO_aircrft.nc
./cmip5/2000/v2/ukca_emiss_NH3.nc
./cmip5/2000/v2/ukca_emiss_OC_fossil.nc
./cmip5/2000/v2/ukca_emiss_OC_biofuel.nc
./cmip5/2000/v2/ukca_emiss_CH4.nc
./cmip5/2000/v2/ukca_emiss_C5H8.nc
./cmip5/2000/v2/ukca_emiss_BC_biomass.nc
./cmip5/1860/v1/ukca_emiss_C3H8.nc
./cmip5/1860/v1/ukca_emiss_Monoterp.nc
./cmip5/1860/v1/ukca_emiss_CO.nc
./cmip5/1860/v1/ukca_emiss_NVOC.nc
./cmip5/1860/v1/ukca_emiss_OC_biomass.nc
./cmip5/1860/v1/ukca_emiss_SO2_low.nc
./cmip5/1860/v1/ukca_emiss_HCHO.nc
./cmip5/1860/v1/ukca_emiss_BC_fossil.nc
./cmip5/1860/v1/ukca_emiss_MeCHO.nc
./cmip5/1860/v1/ukca_emiss_BC_biofuel.nc
./cmip5/1860/v1/ukca_emiss_NO.nc
./cmip5/1860/v1/ukca_emiss_C2H6.nc
./cmip5/1860/v1/ukca_emiss_SO2_high.nc
./cmip5/1860/v1/ukca_emiss_DMS.nc
./cmip5/1860/v1/ukca_emiss_Me2CO.nc
./cmip5/1860/v1/ukca_emiss_NO_aircrft.nc
./cmip5/1860/v1/ukca_emiss_NH3.nc
./cmip5/1860/v1/ukca_emiss_OC_fossil.nc
./cmip5/1860/v1/ukca_emiss_OC_biofuel.nc
./cmip5/1860/v1/ukca_emiss_CH4.nc
./cmip5/1860/v1/ukca_emiss_C5H8.nc
./cmip5/1860/v1/ukca_emiss_BC_biomass.nc
./cmip5/1860/v2/ukca_emiss_C3H8.nc
./cmip5/1860/v2/ukca_emiss_Monoterp.nc
./cmip5/1860/v2/ukca_emiss_CO.nc
./cmip5/1860/v2/ukca_emiss_NVOC.nc
./cmip5/1860/v2/ukca_emiss_OC_biomass.nc
./cmip5/1860/v2/ukca_emiss_SO2_low.nc
./cmip5/1860/v2/ukca_emiss_HCHO.nc
./cmip5/1860/v2/ukca_emiss_BC_fossil.nc
./cmip5/1860/v2/ukca_emiss_MeCHO.nc
./cmip5/1860/v2/ukca_emiss_BC_biofuel.nc
./cmip5/1860/v2/ukca_emiss_NO.nc
./cmip5/1860/v2/ukca_emiss_C2H6.nc
./cmip5/1860/v2/ukca_emiss_SO2_high.nc
./cmip5/1860/v2/ukca_emiss_DMS.nc
./cmip5/1860/v2/ukca_emiss_Me2CO.nc
./cmip5/1860/v2/ukca_emiss_NO_aircrft.nc
./cmip5/1860/v2/ukca_emiss_NH3.nc
./cmip5/1860/v2/ukca_emiss_OC_fossil.nc
./cmip5/1860/v2/ukca_emiss_OC_biofuel.nc
./cmip5/1860/v2/ukca_emiss_CH4.nc
./cmip5/1860/v2/ukca_emiss_C5H8.nc
./cmip5/1860/v2/ukca_emiss_BC_biomass.nc
./cmip5/1970_2010/v1/ukca_emiss_C3H8.nc
./cmip5/1970_2010/v1/ukca_emiss_Monoterp.nc
./cmip5/1970_2010/v1/ukca_emiss_CO.nc
./cmip5/1970_2010/v1/ukca_emiss_NVOC.nc
./cmip5/1970_2010/v1/ukca_emiss_OC_biomass.nc
./cmip5/1970_2010/v1/ukca_emiss_SO2_low.nc
./cmip5/1970_2010/v1/ukca_emiss_HCHO.nc
./cmip5/1970_2010/v1/ukca_emiss_BC_fossil.nc
./cmip5/1970_2010/v1/ukca_emiss_MeCHO.nc
./cmip5/1970_2010/v1/ukca_emiss_BC_biofuel.nc
./cmip5/1970_2010/v1/ukca_emiss_NO.nc
./cmip5/1970_2010/v1/ukca_emiss_C2H6.nc
./cmip5/1970_2010/v1/ukca_emiss_SO2_high.nc
./cmip5/1970_2010/v1/ukca_emiss_DMS.nc
./cmip5/1970_2010/v1/ukca_emiss_Me2CO.nc
./cmip5/1970_2010/v1/ukca_emiss_NO_aircrft.nc
./cmip5/1970_2010/v1/ukca_emiss_NH3.nc
./cmip5/1970_2010/v1/ukca_emiss_OC_fossil.nc
./cmip5/1970_2010/v1/ukca_emiss_OC_biofuel.nc
./cmip5/1970_2010/v1/ukca_emiss_CH4.nc
./cmip5/1970_2010/v1/ukca_emiss_C5H8.nc
./cmip5/1970_2010/v1/ukca_emiss_BC_biomass.nc
./cmip5/1970_2010/v2/ukca_emiss_SO2_low.nc
./cmip5/1970_2010/v2/ukca_emiss_SO2_high.nc
./cmip5/1970_2010/v2/ukca_emiss_DMS.nc
./cmip5/1970_2010/v2/ukca_emiss_NH3.nc
./biogenic/v1/DMS_land_spiro1992.nc
./biogenic/v1/MEGAN-MACC_biogenic_C3H8_clim_2001-2010.nc
./biogenic/v1/POET_oceanic_C3H8_1990_lumped.nc
./biogenic/v1/POET_oceanic_C2H6_1990_lumped.nc
./biogenic/v1/MEGAN-MACC_biogenic_CO_clim_2001-2010.nc
./biogenic/v1/MEGAN-MACC_biogenic_MeCHO_clim_2001-2010.nc
./biogenic/v1/MEGAN-MACC_biogenic_Monoterp_clim_2001-2010.nc
./biogenic/v1/MEGAN-MACC_biogenic_C5H8_clim_2001-2010.nc
./biogenic/v1/POET_oceanic_CO_1990_processed.nc
./biogenic/v1/NH3_ocean_bouwman1997.nc
./biogenic/v1/MEGAN-MACC_biogenic_C2H6_clim_2001-2010.nc
./biogenic/v1/MEGAN-MACC_biogenic_Me2CO_clim_2001-2010.nc
./biogenic/v1/MEGAN-MACC_biogenic_NVOC_clim_2001-2010.nc
./biogenic/v1/nox_soil_0.5_0.5_scaled12TgNO.nc
./biogenic/v1/MEGAN-MACC_biogenic_MeOH_clim_2001-2010.nc
./biogenic/v1/MEGAN-MACC_biogenic_HCHO_clim_2001-2010.nc
./gfed3.1/clim_2002_2011/v1/ukca_emiss_OC_biomass.nc
./gfed3.1/clim_2002_2011/v1/ukca_emiss_BC_biomass.nc
./gfed3.1/clim_2002_2011/v2/ukca_emiss_OC_biomass.nc
./gfed3.1/clim_2002_2011/v2/ukca_emiss_BC_biomass.nc

which contain both year 2000 and in fact 1860 climatologies of all fields, so you should be fine.

You don't need to even edit the ancils file. You can, if you wanted, replace the environment variables with absolute paths, e.g.

='$UM_NETCDF_UKCAEMISS_MONOTP_DIR/$UM_NETCDF_UKCAEMISS_MONOTP_FILE',

could become

='/projects/um1/ancil/atmos/n96e/ukca_emiss/cmip5/2000/v2/ukca_emiss_Monoterp.nc',

etc.

Any files with a single time point are just climatological annual means as you say.

Thanks,
Luke

comment:74 Changed 4 months ago by charlie

Okay, so just to clarify:

The 3 files causing me problems are ukca_emiss_DMS.nc, ukca_emiss_SO2_high.nc and ukca_emiss_SO2_low.nc because these are all timeseries. They are all at: /projects/um1/ancil/atmos/n96e/ukca_emiss/cmip5/1970_2010/v2/

If change these for the same filenames (i.e. ukca_emiss_DMS.nc, ukca_emiss_SO2_high.nc and ukca_emiss_SO2_low.nc) but at
/projects/um1/ancil/atmos/n96e/ukca_emiss/cmip5/2000/v2 then they should be climatologies, then that should be fine, yes?

Am I right in thinking that, after making this change, I will need to run my suite from the beginning rather than just restarting?

Also, why is it that there are many more files in your list I am using it either my mid-Holocene or Eocene suites?

Charlie

comment:75 Changed 4 months ago by luke

Yes, that's correct. You should just be able to replace the 1970_2010 with 2000 and these should work.

You can just restart the suite

rose suite-run --restart

or if it's still running

rose suite-run --reload

followed by re-triggering the failing job step. Note however that this creates a spike in the emissions (or a dip, depending on the trend) so scientifically I would be wary of the results. However, this is fine for testing.

There are more files as there are also emissions for 9 chemical species you aren't including, as well as different datasets and versions of these datasets. To just test that any suite or code changes work then any would be fine, but for particular scientific questions particular sets would be chosen.

Give it a go and see what happens.

Best wishes,
Luke

comment:76 Changed 4 months ago by charlie

Hi Luke,

Very many thanks, all understood.

Just one quick question: do you have any idea for how long the spike/dip would persist before getting back into equilibrium? The reason I ask is because I am currently 20 years into my simulation, and am planning to run it for 50 years - but I'm going to ignore the first 30 years as spin up and only consider the last 20 years for my analysis. Given the relatively short time it takes the atmosphere to get into equilibrium, and that I am 20 years in, do you think the impact of the spike would have disappeared within 10 years, many I could still trust my final 20 years?

Charlie

comment:77 Changed 4 months ago by luke

10 years additional spin-up would probably be OK, but the only way to know is to monitor the jobs as it's progressing, rather than waiting until it's finished to see.

comment:78 Changed 4 months ago by charlie

Hi Luke,

Sorry about this, but I have just tried doing what you suggested (i.e. changing those 3 ancillaries to the 2000 climatologies) and have resubmitted my suite and re-triggered the task. It ran for 10 minutes so my hopes were high, but then failed again. The errors looks similar to the ones before, although perhaps it has failed for another reason? Can you possibly advise?

Charlie

comment:79 Changed 4 months ago by luke

Can you point me to the suite-id and directory containing the log files please.

comment:80 Changed 4 months ago by charlie

No problem. My log files are at ~/cylc-run/u-aw739/log/job/20080901T0000Z/atmos_main/NN

comment:81 Changed 4 months ago by luke

Your new files haven't been picked-up by the suite - see the job.out file:

ukca_em_files(  1) = /projects/um1/ancil/atmos/n96e/ukca_emiss/cmip5/2000/v2/ukca_emiss_BC_biofuel.nc
ukca_em_files(  2) = /projects/um1/ancil/atmos/n96e/ukca_emiss/cmip5/2000/v2/ukca_emiss_BC_fossil.nc
ukca_em_files(  3) = /projects/um1/ancil/atmos/n96e/ukca_emiss/cmip5/1970_2010/v2/ukca_emiss_DMS.nc
ukca_em_files(  4) = /projects/um1/ancil/atmos/n96e/ukca_emiss/cmip5/2000/v2/ukca_emiss_Monoterp.nc
ukca_em_files(  5) = /projects/um1/ancil/atmos/n96e/ukca_emiss/cmip5/2000/v2/ukca_emiss_OC_biofuel.nc
ukca_em_files(  6) = /projects/um1/ancil/atmos/n96e/ukca_emiss/cmip5/2000/v2/ukca_emiss_OC_fossil.nc
ukca_em_files(  7) = /projects/um1/ancil/atmos/n96e/ukca_emiss/cmip5/1970_2010/v2/ukca_emiss_SO2_high.nc
ukca_em_files(  8) = /projects/um1/ancil/atmos/n96e/ukca_emiss/cmip5/1970_2010/v2/ukca_emiss_SO2_low.nc
ukca_em_files(  9) = /projects/um1/ancil/atmos/n96e/ukca_emiss/aerocom/v1/ukca_emiss_SO2_nat.nc
ukca_em_files( 10) = /projects/um1/ancil/atmos/n96e/ukca_emiss/gfed3.1/clim_2002_2011/v2/ukca_emiss_BC_biomass.nc
ukca_em_files( 11) = /projects/um1/ancil/atmos/n96e/ukca_emiss/gfed3.1/clim_2002_2011/v2/ukca_emiss_OC_biomass.nc

You haven't changed the files in your app/um/rose-app.conf file:

ukca_em_files='$UM_NETCDF_UKCAEMISS_BCBIOF_DIR/$UM_NETCDF_UKCAEMISS_BCBIOF_FILE',
             ='$UM_NETCDF_UKCAEMISS_BCFOSS_DIR/$UM_NETCDF_UKCAEMISS_BCFOSS_FILE',
             ='$UM_NETCDF_UKCAEMISS_DMS_DIR/$UM_NETCDF_UKCAEMISS_DMS_FILE',
             ='$UM_NETCDF_UKCAEMISS_MONOTP_DIR/$UM_NETCDF_UKCAEMISS_MONOTP_FILE',
             ='$UM_NETCDF_UKCAEMISS_OCBIOF_DIR/$UM_NETCDF_UKCAEMISS_OCBIOF_FILE',
             ='$UM_NETCDF_UKCAEMISS_OCFOSS_DIR/$UM_NETCDF_UKCAEMISS_OCFOSS_FILE',
             ='$UM_NETCDF_UKCAEMISS_SO2HI_DIR/$UM_NETCDF_UKCAEMISS_SO2HI_FILE',
             ='$UM_NETCDF_UKCAEMISS_SO2LOW_DIR/$UM_NETCDF_UKCAEMISS_SO2LOW_FILE',
             ='$UM_NETCDF_UKCAEMISS_SO2NAT_DIR/$UM_NETCDF_UKCAEMISS_SO2NAT_FILE',
             ='$UM_NETCDF_UKCAEMISS_BCBIOM_DIR/$UM_NETCDF_UKCAEMISS_BCBIOM_FILE',
             ='$UM_NETCDF_UKCAEMISS_OCBIOM_DIR/$UM_NETCDF_UKCAEMISS_OCBIOM_FILE'

Your install_ancil is

[command]
default=true

[env]
ANCILRES=n96e_orca025
ANCILREV=''
ANCILROOT=$UMDIR/ancil/data/ancil_versions
ANCILVN=GA7.0_AMIP/v2

[file:$ROSE_DATA/etc/um_ancils_gl]
source=/home/d05/cwilliams/ga71/ancil_versions/GA7p1_UM10p7_deepmip_cjrw_pol10k4k

This file at the end does show that you've set the directories correctly

export UM_NETCDF_UKCAEMISS_DMS_DIR=$UM_ANCIL_N96EDIR/ukca_emiss/cmip5/2000/v2
export UM_NETCDF_UKCAEMISS_SO2HI_DIR=$UM_ANCIL_N96EDIR/ukca_emiss/cmip5/2000/v2
export UM_NETCDF_UKCAEMISS_SO2LOW_DIR=$UM_ANCIL_N96EDIR/ukca_emiss/cmip5/2000/v2

However, you won't have been able to re-trigger the install_ancil app. This is only run on the first jobstep I think. It is possible to insert an app at a cycle point, but I'm afraid that I can't remember exactly how to do that.

My advice to re-trigger the atmos_main task was assuming that you were inserting these files in the run_ukca namelist in the app/um/rose-app.conf file, i.e. it will only work if the absolute path to the file is set. Putting the path in the ancils file won't work as this isn't re-read, so the environment variables will still be set to what they were before.

Does this make sense?

Please put the full paths to the files in the app/um/rose-app.conf and try again.

Also, when providing paths to files for me to see, please give the full path (i.e. beginning /home/...) because for me ~/ resolves as my home directory. You can get this full path by typing pwd.

Many thanks,
Luke

Last edited 4 months ago by luke (previous) (diff)

comment:82 Changed 4 months ago by charlie

Sorry Luke, I should have realised that. I thought it was easier to change the ancillary version file rather than the actual list itself, but should have realised that this isn't picked up each time the task is run. I'm now in the process of changing over the actual list itself (i.e. in rose-app.conf). Quick question: is the layout exactly the same, i.e. with the same = and i.e.

ukca_em_files='/projects/um1/ancil/atmos/n96e/ukca_emiss/cmip5/2000/v2/ukca_emiss_BC_biofuel.nc',
='/projects/um1/ancil/atmos/n96e/ukca_emiss/cmip5/2000/v2/ukca_emiss_BC_biomass.nc',

and so on?

comment:83 Changed 4 months ago by luke

Yes, as I said in #comment:73 you can just change the path within the ' quotes to the new full one.

comment:84 Changed 4 months ago by luke

Also, make sure you remove the old failing paths set using environment variables as well, you can't just add to the list.

comment:85 Changed 4 months ago by charlie

Yes of course. One more quick question: should there be a space in between each, or an enter/new line?

comment:86 Changed 4 months ago by luke

Are you editing in Rose or the file itself in a text editor? If Rose, these are separated by spaces, if in the text file these are on new lines with an = sign in front, exactly like I have in #comment:73 .

Essentially you just copy exactly how it is being done in the application you are viewing & editing it in.

comment:87 Changed 4 months ago by charlie

Understood, many thanks, I thought that was the case but just wanted to clarify as I know it's touchy about that sort of thing.

Just had a thought: whilst my suite was stopped and whilst doing all of this, I also made a small change to one of the ancillary files (namely the vegetation faction file). If I am just restarting my suite and retriggering the atmos_main task does that mean it isn't running recon again and therefore won't pick the change I made to the ancillary file?

comment:88 Changed 4 months ago by charlie

Okay, it didn't like that straightaway this time, giving me the following error more or less straightaway:

[FAIL] /home/d05/cwilliams/cylc-run/u-aw739/app/um/rose-app.conf(2531): expecting "[SECTION]" or "KEY=VALUE"
[FAIL] ='/projects/um1/ancil/atmos/n96e/ukca_emiss/cmip5/2000/v2/ukca_emiss_BC_fossil.nc',
[FAIL]
2018-06-08T14:19:37Z CRITICAL - Task job script received signal EXIT

I have doublechecked the path, and it is correct.

comment:89 Changed 4 months ago by luke

Reconfiguration is used to make up the initial start dump for the atmos run - since this has already started this won't be run (it's like install_ancil.

I'm not familiar with how the vegetation fraction works - is this just initialised at the start or is it a timeseries/climatology/constant file that the UM reads-in as it runs? If the former, your change won't affect this run (as this is only done at the reconfiguration step), if the latter it should be picked up on a --restart or --reload if you've set it in the app/um/rose-app.conf file and NOT in the ancils file.

In terms of the error above, I can see that you've now changed the relevant section of the app/um/rose-app.conf file to be

ukca_em_dir=''
ukca_em_files='/projects/um1/ancil/atmos/n96e/ukca_emiss/cmip5/2000/v2/ukca_emiss_BC_biofuel.nc',
='/projects/um1/ancil/atmos/n96e/ukca_emiss/cmip5/2000/v2/ukca_emiss_BC_fossil.nc',
='/projects/um1/ancil/atmos/n96e/ukca_emiss/cmip5/2000/v2/ukca_emiss_DMS.nc',
='/projects/um1/ancil/atmos/n96e/ukca_emiss/cmip5/2000/v2/ukca_emiss_Monoterp.nc',
='/projects/um1/ancil/atmos/n96e/ukca_emiss/cmip5/2000/v2/ukca_emiss_OC_biofuel.nc',
='/projects/um1/ancil/atmos/n96e/ukca_emiss/cmip5/2000/v2/ukca_emiss_OC_fossil.nc',
='/projects/um1/ancil/atmos/n96e/ukca_emiss/cmip5/2000/v2/ukca_emiss_SO2_high.nc',
='/projects/um1/ancil/atmos/n96e/ukca_emiss/cmip5/2000/v2/ukca_emiss_SO2_low.nc',
='/projects/um1/ancil/atmos/n96e/ukca_emiss/aerocom/v1/ukca_emiss_SO2_nat.nc',
='/projects/um1/ancil/atmos/n96e/ukca_emiss/gfed3.1/clim_2002_2011/v2/ukca_emiss_BC_biomass.nc',
='/projects/um1/ancil/atmos/n96e/ukca_emiss/gfed3.1/clim_2002_2011/v2/ukca_emiss_OC_biomass.nc'
             
!!ukca_h1202mmr=0

I suspect that there are possibly a few things going on here:

  1. You only strictly needed to update the DMS, SO2HI, and SO2LOW files, although this won't be the cause of this problem.
  2. You have a blank line before !!ukca_h1202mmr=0, although this might be OK.
  3. You haven't lined-up the = signs to be indented (by spaces I believe) to match with the one after ukca_em_files. I think that this is where you have your problem.

Please correct for point 3 above and try again.

Thanks,
Luke

comment:90 Changed 4 months ago by luke

Also, correct for point 2 as well, sorry.

comment:91 Changed 4 months ago by charlie

Okay, done that and resubmitted…..

comment:92 Changed 4 months ago by charlie

Hi Luke,

Sorry, but more problems. The good news is it ran for one complete cycle (i.e. finished the cycle it was stuck on previously) but then failed a couple of hours into the next cycle, again at the atmos_main task. I have had a look at the various logs, and the only new error (which I haven't seen before) is:

?  Error message: North/South halos too small for advection.

?        See the following URL for more information:[248] exceptions: An non-exception application exit occured.

[248] exceptions: whilst in a serial region
?        https://code.metoffice.gov.uk/trac/um/wiki/KnownUMFailurePoints[248] exceptions: Task had pid=65169 on host nid06043

[248] exceptions: Program is "/home/d05/cwilliams/cylc-run/u-aw739/share/fcm_make_um/build-atmos/bin/um-atmos.exe"
?  Error from processor: 243Warning in umPrintMgr: umPrintExceptionHandler : Handler Invoked

I have taken a look at the link above, and will admit that I don't entirely understand what the problem is or how to resolve it. I could try following instructions on how to investigate this, if you think this would help?

Many thanks,

Charlie

comment:93 Changed 4 months ago by luke

Hi Charlie,

This may or may not be to do with the UKCA files. How soon after the switch-over did the crash happen? If soon after the switch-over then you could try doing a new run with these included from the start.

I'm afraid that I haven't seen this error for a while, and I'm not sure how to progress with it further. The page pointed to, i.e.

https://code.metoffice.gov.uk/trac/um/wiki/KnownUMFailurePoints#NorthSouthhalosHalostoosmallforadvection

has some suggestions as to how to debug.

Thanks,
Luke

comment:94 Changed 4 months ago by charlie

Hi Luke,

I resubmitted the suite (with the new files) while its was halfway through its four-year cycle, and it finished that cycle and began the next, again getting about halfway through. So it didn't fail soon after the switchover at all.

Charlie

comment:95 Changed 4 months ago by grenville

  • Resolution set to answered
  • Status changed from new to closed

Charlie

This ticket has become too unwieldy, addressing multiple issues - we'll close it now. We can refer to it if needed in new queries.

Grenville

comment:96 Changed 4 months ago by charlie

Many apologies.

Note: See TracTickets for help on using tickets.