Opened 2 weeks ago

Last modified 8 hours ago

#2464 new help

Postprocessing for AMIP suite on NEXCS

Reported by: charlie Owned by: um_support
Priority: normal Component: UM Model
Keywords: Cc:
Platform: NEXCS UM Version: 10.7

Description

Hi again,

Right then, a totally different question this time and nothing to do with JASMIN, hence the new ticket.

I am almost ready to begin running some new atmosphere-only experiments (GA7.1), using the following suite: u-aw739. This is a modern AMIP suite, which has been modified by myself and others to contain Eocene (~50 Mya) ancillary files. It is currently on Monsoon (or is it Monsoon2, or NEXCS?), not PUMA.

I note that, at present, postprocessing is set to put everything on MASS under :crum however, from my previous conversations with you, I gather I don't want to do this? The project account I am using is n02-nexcs which is where we have allocated resources.

Therefore, please would you advise on what exactly I need to do to set up where my output (both immediate and post-processed) goes?

Many thanks,

Charlie

Change History (29)

comment:1 Changed 11 days ago by grenville

Charlie

The details are presented here:

http://cms.ncas.ac.uk/wiki/Docs/PostProcessingApp

I'd urge you to set this up and test it thoroughly before setting off any long runs.

Grenville

comment:2 Changed 9 days ago by charlie

Hi Grenville,

Many thanks, I will work through these instructions tomorrow. Before I do, though, a couple of questions:

Firstly, how do I know whether I need to upgrade to the ARCHER/NEXCS postprocessing release version? As I said, my suite is currently a GA7.1 suite and is archiving data to MASS, so do I still need to go through this upgrade process?

Secondly, the instructions say to change the archive_root_path to /projects/nexcs-n02/<username> but I'm not sure I have a username here yet? What do I need to do to get an account here?

Lastly, in terms of transferring the data to JASMIN, the instructions say to use one of the group workspaces on JASMIN. I have now been granted access to the nexcs workspace, so is this the one I use? If so, how much storage is available to me here? To begin with, I will just be running a couple of test runs (e.g. a year, just to see if everything works) but imminently I will be running some longer simulations - at least several, each of which will be 50 years or so in length. Do I need to worry about filling up the space?

Thanks,

Charlie

comment:3 Changed 8 days ago by ros

Hi Charlie,

The ARCHER/NEXCS postprocessing went in at postproc_2.2 which is not tied to any UM version/configuation. So if your suite is not using this version you will need to upgrade. Look at the fcm_make_pp/postproc apps and you will see what version is being used by the path to the metadata. If that makes no sense just run the upgrade command as instructed; if your suite is already at the correct version it will do nothing.

If you have authorisation to use the nexcs-n02 budget then you can create a projects directory /projects/nexcs-n02/<username>.

As to your last question, I don't know what space is available under JASMIN nexcs workspace. I'll let Grenville answer that one when he returns tomorrow.

Regards,
Ros.

comment:4 Changed 8 days ago by charlie

Thanks Ros.

I have now worked through the upgrade process (because previously it was set to version 2.0) and, much to my surprise, both apps have upgraded successfully. And I have also successfully created my directories, on both NEXCS and the JASMIN NEXCS groupworkspace.

So, I have now followed the rest of the instructions at the above website, including filling in the JASMIN transfer section. One question here: what should the remote host be - jasmin-xfer1.ceda.ac.uk or jasmin-xfer2.ceda.ac.uk? I know that the first is open to everybody (but is slower) and the 2nd needs to be registered for, so is it already registered from NEXCS or do I need to do something here myself?

So, now that this is done, I am about to begin a 2 year test run. Before I do that, however, is there an ideal set of processes for a GA7.1 suite on NEXCS (I know you did lots of testing for Archer, but what about this?)? Or is it just a case of trial and error? At the moment I have the following:

Reconfiguration
East-West = 4, North-South = 8, Open-MP = 1, Hyperthreads = 1

Atmosphere
I/O server = 0, East-West = 16, North-South = 28, Open-MP = 1, Hyperthreads = 1

This sounds about right to me (although it is less that my GC3.1 suite), or do you think it should be higher?

Charlie

comment:5 Changed 8 days ago by charlie

Further to my last message, I have just read the page at http://cms.ncas.ac.uk/wiki/Docs/PostProcessingAppNexcsSetup - I haven't done these steps yet, but would you be able to talk me through step 3 here, as the example is for a coupled suite whereas my suite is atmosphere only? Also, at the bottom of this page it says that "some setup is required to enable non-interactive authentication from NEXCS to JASMIN" and that I should email you about this, so please would you advise?

comment:6 Changed 8 days ago by ros

Hi Charlie,

You can apply for access to the jasmin-xfer2 here:

http://www.jasmin.ac.uk/services/high-performance-data-transfer/

NEXCS is whitelisted so don't worry about supplying an IP address in the application.

ARCHER and XCS are similar machines so the same setup should be fine - just remember that there are 32 cores per node on XCS so make sure the decomposition is divisible by 32 rather than 24.

Please give me write access to your suite and I'll add the relevant stuff into the .rc files as this suite is setup totally different to the example. Add the following to the top of the rose-suite.info file:

access-list=charliewilliams rosalynhatcher

And then fcm commit the suite.

I'll send you the instructions for setting up the authentication from NEXCS to JASMIN by email.

Cheers,
Ros.

comment:7 Changed 8 days ago by charlie

Hi Ros,

Excellent, very many thanks.

Firstly, I already have access to the high-performance data transfer, because I applied for it several months ago. However, this wasn't from NEXCS, but rather from our home server (um1), because at the time I was transferring data from our home servers to JASMIN for archiving. Under my services on the JASMIN website, I am still registered for this service (and will be until 2027!) so do I need to apply again but from a different machine?

Secondly, I have changed the processes as you advised, so that they are now

Atmosphere
I/O server = 6, East-West = 24, North-South = 28, Open-MP = 2, Hyperthreads = 1

I thought this was ok, because 24*28 = 672, which /32 = 21. Or do I need to multiply all of the processes together, in which case 6*24*28*2*1 = 8064, which /32 = 252. So either way they are divisible by 32.

Lastly, I have now added that line into my rose-suite.info and have committed the suite, so you should be able to access it.

Very many thanks,

Charlie

comment:8 Changed 8 days ago by ros

Hi Charlie,

If you already have access to the high performance data transfer node you don't need to do anything else as NEXCS is already registered with JASMIN.

Yes you multiple all the processors together.

Yes I confirm I now have write access to your suite. I'll let you know when I've done.

Cheers,
Ros.

comment:9 Changed 8 days ago by ros

Hi Charlie,

I've modified the suite to add pptransfer to the graph. It was simpler than it looked! Give that a whirl and we'll see what happens.

Cheers,
Ros.

comment:10 Changed 7 days ago by charlie

Hi Ros,

It was too good to be true to expect my suite to run straight away - it failed yesterday evening at the reconfiguration stage. I have taken a look at the error file, and as always there are lots of warnings (including the best ever error I have seen, "info: Terminal type `dumb' is not smart enough to run Info") but I'm not sure what the problem is. At the end of the file is:

lib-4611 : UNRECOVERABLE library error

Missing opening (left) parenthesis in format.

Encountered during a sequential formatted WRITE to an internal file (character variable)
Application 27382592 is crashing. ATP analysis proceeding…
atpFrontend.exe: main: retrieveRawMBT:: recv of BT_HERE_IS_BACKTRACE failed

atpAppSigHandler timed out waiting for shutdown. Re-raising signal.
_pmiu_daemon(SIGCHLD): [NID 06081] [c3-2c2s0n1] [Thu May 17 18:03:35 2018] PE RANK 0 exit signal Aborted
[NID 06081] 2018-05-17 18:03:35 Apid 27382592: initiated application termination
[FAIL] um-recon # return-code=134
2018-05-17T18:03:36Z CRITICAL - Task job script received signal EXIT

But what does this mean?

Charlie

comment:11 Changed 7 days ago by ros

Hi Charlie,

I'm guessing this is caused by a badly formatted write statement somewhere in the UM where it's trying to write an error message about a missing ancillary file. Unfortunately ATP didn't give a backtrace so can't easily see where. :-(

However, if you look in the job.out file at the bottom you will see that the suite failed to find the ancillary file: /projects/ukesm/wilro/cylc-run/u-ar826/share/data/n96e_orca025_go6/orography/globe30/qrparm.orog. Hopefully fixing the path to this ancillary file will get past the error message above.

This suite is setup dangerously in my opinion as it is relying on files in someone else's cylc-run directory. If they rerun this suite, files are likely to change. I would suggest talking to the owner of the suite that you copied to (a) find the file that is missing and (b) check if there is a better location you should be referencing.

Regards,
Ros.

comment:12 Changed 7 days ago by charlie

Thanks Ros. I was actually aware of this dodgy setup, because I would always prefer to have all of the non-standard ancillary files in my own directory, so I know exactly where they are and what's happening to them. I spoke to the original owner about this a while ago, about whether it was worth me copying all of his modified ancillaries (which are needed in the suite), and he thought that wasn't necessary because he wouldn't change any of them. However, perhaps I should, just in case.

Presumably it's just a case of looking in the ancillary versions file, see which one are in his directory (i.e. which ones he has modified) and copying these? I don't need to move all the ancillaries, do I (i.e. the ones he hasn't touched, which are still in the standard directories)?

comment:13 Changed 7 days ago by ros

Presumably it's just a case of looking in the ancillary versions file, see which one are in his directory (i.e. which ones he has modified) and copying these? I don't need to move all the ancillaries, do I (i.e. the ones he hasn't touched, which are still in the standard directories)?

Yes, that should be right.

Cheers
Ros.

comment:14 Changed 7 days ago by charlie

Okay, many thanks. I have now copied all of the non-standard ancillary files (i.e. the ones he modified) into my own directories, and have changed the locations in the ancillary versions file. The problem was that it was looking for qrparm.orog which, although the file was present in his directory, the name had been changed slightly to qrparm.orog_1. So I have copied this one into my own directory, along with all the others, and have changed the name back to qrparm.orog so hopefully it will find it okay.

I have now re-submitted my job, so will let you know what happens…

comment:15 Changed 6 days ago by charlie

Hi Ros,

New error this time, at the atmos_main stage. I don't seem to have a error file or a log file this time, however within the activity log I have found the following problem:

(xcs-c) 2018-05-18T21:48:44Z [STDERR] qsub: Job violates queue and/or server resource limits
[(('event-mail', 'submission failed'), 7) ret_code] 0

Is this the reason it has failed, and if so what does this mean? I thought I was using the correct project code and the correct queue, but perhaps not.

Charlie

comment:16 Changed 5 days ago by grenville

Charlie

this is the problem:

#PBS -l walltime=18000

#PBS -q normal

the normal queue only allows jobs up to 14400 seconds - please see the Queues section of https://collab.metoffice.gov.uk/twiki/bin/view/Support/NEXCS

Grenville

comment:17 Changed 4 days ago by charlie

Many thanks, and sorry I didn't realise that. I have now changed my wall time to 4 hours, 14400 seconds, and have resubmitted, so fingers crossed.

I changed this within the rose editor, but am I correct in thinking that it is also specified in ~/roses/<suiteID>/rose-suite.conf? If so, what are the other versions of this file for e.g. rose-suite.conf_archer, rose-suite.conf_meto_cray, rose-suite.conf_monsoon etc? Or are these just included as standard, depending on which machine I am on? Given that I running it on nexcs, which one do I want?

comment:18 Changed 4 days ago by ros

Hi Charlie,

Yes they are the same thing. The rose editor is just a GUI for editing the various .conf files so you could have just changed the walltime in the rose-suite.conf file if you wanted.

The other versions of this file are just examples of the rose-suite.conf file for other platforms. So in theory if you changed to run on ARCHER for example you could copy the rose-suite.conf_archer file to rose-suite.conf to save you manually changing some the standard platform settings but I wouldn't like to say how up-to-date they actually are! Anyway you can ignore these for what you are doing.

Cheers,
Ros.

comment:19 Changed 3 days ago by charlie

Hi, I left my suite running last night, but having checked today everything seems to be frozen. Is there something wrong with monsoon today? I don't think I missed a maintenance email, but perhaps I did. Either way, the cylc window has frozen, and if I try to logout again login again, to either xcs or exvmsrose, it just freezes. So I have no way of knowing if my suite ran, or indeed how far through its two-year test run it got.

What I can tell, however, is that nothing could be transferred to Jasmin, to my specified nexcs area.

Charlie

comment:20 Changed 3 days ago by ros

Hi Charlie,

Every Tuesday 9:00-11:00 is reserved as a maintenance window for updates and patches; there may be unadvertised disruption at these times.

Cheers,
Ros.

comment:21 Changed 3 days ago by charlie

Thanks Ros, and sorry I realise that. I will give it half an hour and then try again. But, as I said, something can't be exactly right, because surely if my two-year run had worked, it would have transferred the output to Jasmin? It hasn't.

comment:22 Changed 3 days ago by charlie

  • Didn't realise that

comment:23 Changed 3 days ago by ros

Hopefully you've just seen the Yammer alert that's been posted:

"Issues with XCSC - Tuesday 22nd May 2018 11:34
There seems to be an issue with logging onto XCSC and ROSE.
This is being investigated."

comment:24 Changed 3 days ago by ros

Hi Charlie,

Sorry I only added the relevant bits to the *.rc files (points 3 & 4); I thought you had done points 1 & 2 - that would be why it hasn't worked!

Please do points 1 & 2 to add in the controlling PPTRANSFER variable:
http://cms.ncas.ac.uk/wiki/Docs/PostProcessingAppNexcsSetup

When you submit the suite you should see a separate pptransfer task in the GUI which will run after postproc. If you don't then the suite is still not setup correctly.

Cheers,
Ros.

comment:25 Changed 34 hours ago by charlie

Hi Ros,

Sorry for the delay, I was in Bristol all day yesterday.

I have now completed steps 1 & 2 as instructed, no problem.

Having looked at my output on NEXCS (at ~/cylc-run/u-aw739/share/data/History_Data), however, although there is some output, it has clearly not run for 2 years. There is only one month's worth of data, and it is August. So, 2 questions:

Firstly, given that my model basis time is 19880901 i.e. September, why is my first output month August? I have tried looking at the actual start dump within ainitial, but this is pointing to the environment variable '$AINITIAL_N96' - where is this defined, so I can look at the actual start dump?

Secondly, what caused it to fall over? I don't seem to have any output and error log files (i.e. ~/cylc-run/u-aw739/log/job/<cycle>/<app>/NN/job.out or job.err), in fact I don't seem to have a ~/cylc-run/u-aw739/log directory at all.

Am I missing something?

Charlie

comment:26 Changed 34 hours ago by charlie

Sorry, further to my last message, I do have a ~/cylc-run/u-aw739/log/job/<cycle>/<app>/NN/job.out and job.err).

comment:27 Changed 28 hours ago by grenville

Much of the data has been post-processed - please look in /home/d05/cwilliams/cylc-run/u-aw739/log/job/19900301T0000Z/postproc/01/job.out - data has been converted and moved, for example:

Archiving /home/d05/cwilliams/cylc-run/u-aw739/share/data/History_Data/aw739a.da19900301_00 to /projects/nexcs-n02/cwilliams/sweet/ga71/u-aw739/19900301T0000Z

What shows in the cylc window - it appears that the model has run successfully. There are 4 6-month cycles in projects/nexcs-n02/cwilliams/sweet/ga71/u-aw739

You can simply grep -r in the roses suite directory to find $AINITIAL_N96

I am not quite sure why there are only to cycles in the log directory (there are 4 in the work directory)

Grenville

comment:28 Changed 28 hours ago by charlie

Thanks Grenville, and sorry, I temporarily forgot the output went here. So do you think, given that I ran it for 2 years starting in September 1988, that it ran successfully? If so, presumably now that I have turned on PPTRANSFER, it should automatically send all of this output over to my specified directory on Jasmin? Does it automatically remove the output on NEXCS, or do I need to do this manually?

And yes - why are there 4 cycles in this directory (which is expected, all containing 6 months), but only 2 in the log directory? And why is there only one month in each, August for the first, for example?

Lastly, in my output work directory, there appears to be all the usual output files e.g. daily etc, but there doesn't appear to be any monthly (*pm) or annual means (*py). I thought I had these being output, based on the standards stash table I am using, but perhaps not?

comment:29 Changed 8 hours ago by grenville

Charlie

Your climate meaning sequence is 3,3,3,4 with 30-day dumping — that should be 1,3,4,10 (monthly, seasonal, yearly, decadal) for 30-day dumping.

Pl try that.

Grenville

Note: See TracTickets for help on using tickets.