Opened 6 weeks ago

Last modified 10 days ago

#2703 accepted help

Submitting CMIP6 preindustrial control

Reported by: charlie Owned by: ros
Priority: highest Component: UM Model
Keywords: Cc:
Platform: NEXCS UM Version: 10.7

Description

Hi,

I realise you are all very busy but this is quite urgent. I am trying to copy and run the CMIP6 preindustrial control, with a view to eventually modifying it to run the CMIP6 Eocene experiment.

Firstly, I have copied what I think is the correct suite: u-ar766/u-ar777@89348. There were several revision numbers associated with this suite, but this was the latest. Can you confirm that this is indeed the latest and most kosher version of the CMIP6 preindustrial control?

My suite is now u-be040.

Secondly, I am having problems getting it to run, even before I have started making any changes (other than the submitting machine). It is failing at the various fcm stages, which I suspect means I have forgotten something in terms of the host, or machine, or similar. I thought I had done everything correctly, matching another suite of mine which works fine, but clearly something is wrong. I am running this on nexcs-n02.

Please can you help?

Charlie

Change History (23)

comment:1 Changed 6 weeks ago by ros

Hi Charlie,

In answer to your first question, you will need to contact the owner of suite u-ar766, which is Till Kuhlbrodt.

Please confirm first with Till and then we will take a look at the errors you have got once you have confirmed you are running right suite.

Regards,
Ros.

comment:2 Changed 5 weeks ago by charlie

Hi Ros,

Sorry for the delay with this. I have now tracked down the actual PI control, which was a slightly different revision to the one mentioned above. My suite is now u-be195@100708) which is a direct copy of Till's PI control (u-ar766@86126).

To begin with, I am just trying to get it to run, before I start making changes. I have changed what I thought I needed to, i.e. the project code, the machine and the queue, but I am still getting the same problem as above - fcm_make_drivers is giving me a submit-failed error, and other tasks are just waiting (although certain tasks, e.g. fcm_make_um and install_ancil have succeeded, surprisingly). Please can you help?

Thanks,

Charlie

comment:3 Changed 5 weeks ago by ros

Hi Charlie,

This suite is not setup with Monsoon/NEXCS specific settings so is trying to use slurm which is not available. Having done a quick difference with Monsoon/Met Office settings in a similar suite it should just be a simple edit of the site/meto_cray.rc file to change slurm to background in the [[EXTRACT_RESOURCE]] section so you will have:

   [[EXTRACT_RESOURCE]]
        [[[job]]]
            batch system = background
            execution time limit = PT5M
        [[[directives]]]
            --mem=1G
            --ntasks=1

Give that a go - that will fix the problem with fcm_make_drivers. Once that's completed the other tasks will begin to run.

I'm now on holiday till the New Year. Another member of the team will pick this up in my absence if you are still having problems.

Regards,
Ros.

Last edited 5 weeks ago by ros (previous) (diff)

comment:4 Changed 5 weeks ago by charlie

Very many thanks Ross, I did wonder where 'slurm' was coming from, because I have never seen that before. I have now done as you suggested and have just resubmitted.

However, I don't understand why the suite was not set up with Monsoon/NEXCS specific settings? I copied it directly from the owner, Till (using the correct revision) so where would he have run this, if not Monsoon?

Anyway, I have now done as you suggested, but it has failed a little further in during the fcm_make_um stage and has given me the following error:

[FAIL] ftn-2136 crayftn: ERROR in command line
[FAIL]   Unable to obtain a Cray Compiling Environment License.
[FAIL] compile    0.0 ! pc2_hom_arcld.o      <- um/src/atmosphere/large_scale_cloud/pc2_hom_arcld.F90
[FAIL] link      ---- ! um-atmos.exe         <- um/src/control/top_level/um_main.F90
[FAIL] ! pc2_hom_arcld.o     : update task failed
[FAIL] ! um-atmos.exe        : depends on failed target: pc2_hom_arcld.o

[FAIL] fcm make -f /projects/nexcs-n02/cwilliams/cylc-run/u-be195/work/18500101T0000Z/fcm_make_um/fcm-make.cfg -C /var/spool/jtmp/2987518.xcs00.cBchiR/fcm_make_um.18500101T0000Z.u-be195zpspXX -j 6 --archive # return-code=2
2018-12-19T14:15:52Z CRITICAL - failed/EXIT

What does this mean? I realise you are away, but hopefully somebody else will pick this up. Hope you have a lovely holiday,

Charlie

comment:5 Changed 5 weeks ago by grenville

Charlie

Till must have run this on the internal MO machine. You may just have to try again to get a compiler license. If that problem persists, it's one for Monsoon to address.

Grenville

comment:6 Changed 5 weeks ago by charlie

I have already submitted my suite 3 times, because I recognised that error about the licence having seen it before. Last time, it worked on the 2nd attempt. This time, however, I have tried 3 times this morning, and same error. Does that imply it's a problem their end, not mine?

comment:7 Changed 3 weeks ago by charlie

Hi,

Happy New Year to you all.

I was very much hoping we could get back to this as a matter of urgency, as I really really need to get this suite running really urgently? It would appear that the licence error above has now resolved itself, because the fcm_make_um stage succeeded this time. However, it is now falling over at fcm_make_ocean with

WARNING:
This computer is provided for the processing of official information.
Unauthorised access described in Met Office SyOps may constitute a criminal offence.
All activity on the system is liable to monitoring.
[WARN] symlink ignored: svn://puma.nerc.ac.uk/nemo.xm_svn/trunk/NEMOGCM/CONFIG/SHARED/1_namelist_ref@5518
CPU time limit exceeded
2019-01-03T12:18:49Z CRITICAL - failed/XCPU

I'm guessing this means that the suite is currently set up to not give enough time to this, so where do I need to increase this time limit?

Many thanks,

Charlie

comment:8 Changed 3 weeks ago by ros

Hi Charlie,

Have you tried this more than once? It looks like a dodgey connection to PUMA rather than an execution timeout.

Happy New Year.

Regards,
Ros.

comment:9 Changed 3 weeks ago by ros

  • Owner changed from um_support to ros
  • Status changed from new to accepted

P.S. If you do need to change the time limit for the extracts it is set in the site/meto_cray.rc file under [[EXTRACT_RESOURCE]]

comment:10 Changed 3 weeks ago by charlie

Okay, I have just tried it a 2nd time, and this time it has succeeded. Do you know this would have happened? It can't be a connection problem to puma, because I am not using it - I am doing all of this on NEXCS (exvmsrose).

Anyway, it has now run, but failed at the recon saying it can't find the atmosphere restart dump, /data/d01/ukcmip6/Restarts/u-aq853/aq853a.da25940101_00. This is indeed correct, because it isn't here - presumably this is because Till originally ran this on the internal MO machine? Do you know where the equivalent dump would be NEXCS? I have had a hunt around, but can't see anything remotely similar to this directory.

Charlie

comment:11 Changed 3 weeks ago by ros

Hi Charlie,

The source code is currently extracted from the MOSRS mirrors on PUMA for logistical reasons, as per the path in the error message you posted.

I can't see an equivalent of that restart directory either. I'm just making a couple of enquiries and if not will copy the u-aq853 directory over for you.

Cheers,
Ros.

comment:12 Changed 3 weeks ago by charlie

Many thanks.

comment:13 Changed 2 weeks ago by ros

Hi Charlie,

I've copied the files over for you. They are in my directory /projects/nexcs-n02/rhatcher/for_charlie. Please copy them to your space.

Regards,
Ros.

comment:14 Changed 2 weeks ago by charlie

Thanks very much Ros, and sorry for the delay. I have now copied those files to my space.

I have now resubmitted, and although it got past the recon stage, it failed at the coupled stage and is giving me the error below:

???!!!???!!!???!!!???!!!???!!!       ERROR        ???!!!???!!!???!!!???!!!???!!!
?  Error code: 1
?  Error from routine: EG_BICGSTAB
?  Error message: Convergence failure in BiCGstab, omg is NaN
?        This is a common point for the model to fail if it
?        has ingested or developed NaNs or infinities
?        elsewhere in the code.
?        See the following URL for more information:
?        https://code.metoffice.gov.uk/trac/um/wiki/KnownUMFailurePoints
?  Error from processor: 821
?  Error number: 11
????????????????????????????????????????????????????????????????????????????????

I have seen this error before many times, and it usually means that NaNs? are being created by one of the ancillary files not being right. I don't entirely understand this, however, because I have already tried using my ancillary files (which have all been modified) using another suite, which ran fine.

Can you possibly advise on how I can find out which one is causing the blowup? If you don't know, I will contact my Met Office collaborator who helped me get my other suite (with the same files) working.

Many thanks,

Charlie

comment:15 Changed 13 days ago by charlie

Hi Grenville,

My goodness, that's so frustrating! The issue is because when it comes to setting the emissions, I do it directly within app/um/rose-suite.conf rather than within the GUI. I therefore didn't see the error, and hadn't spotted that the line ukca_em_dir='' was missing in this file, because it was included in my other suite's version of this file. So I simply pasted in my files into ukca_em_files without realising that this first line was missing.

So annoying - why can't suites be more consistent?!

I have now resubmitted the suite but using my new ancillaries (all modified to use a new land sea mask), so will let you know what happens… As I said in my last message above, I have already successfully run a test using all of these new ancillaries, and it ran fine.

Charlie

comment:16 Changed 13 days ago by charlie

Right, very disappointing - my new suite has failed again at the same point, roughly 6 minutes into the coupled stage, giving the same error (below)

????????????
???!!!???!!!???!!!???!!!???!!!       ERROR        ???!!!???!!!???!!!???!!!???!!!
?  Error code: 1
?  Error from routine: EG_BICGSTAB
?  Error message: Convergence failure in BiCGstab, omg is NaN
?        This is a common point for the model to fail if it
?        has ingested or developed NaNs or infinities
?        elsewhere in the code.
?        See the following URL for more information:
?        https://code.metoffice.gov.uk/trac/um/wiki/KnownUMFailurePoints
?  Error from processor: 756
?  Error number: 11
????????????????????????????????????????????????????????????????????????????????

As I said in a previous message above, in my recent experience this has is one of the ancillary files is wrong. However, as I said above, I have tested all of these ancillary files in another suite, and it ran perfectly well. It took me a long time to get them to all work, of course, but I finally did. The suite which works (u-bc205) should be exactly the same version/configuration etc, it just isn't the kosher CMIP6 preindustrial control. My current suite (u-be195), which doesn't work, is. The ancillary files should be identical. Please can you help?

Thanks,

Charlie

comment:17 Changed 13 days ago by grenville

Charlie

At this stage, I can only suggest switching on extra diagnostic messages in both u-bc205 and u-be195 to see where the models may differ; if that yields nothing, then make a comparison of the suites.

Grenville

comment:18 Changed 12 days ago by charlie

Hi Grenville,

Okay, I have now done this i.e. I have resubmitted both suites (u-be195 and u-bc205) with extra diagnostics turned on. As before, u-be195 failed at the same point (5 minutes in), with a "retrying in PT3H" message but with the same error message as above in job.err. In contrast, u-bc205 is running fine and has been doing so for almost an hour now. I'll kill it.

Where would I need to look in the extra diagnostics output to find out why one is failing when the other isn't?

As I said, they are both using identical ancillary files. I don't entirely know exactly what are the origins of u-bc205, because I copied it from the parent suite to Peter's LGM suite. It is starting in 1978, so I'm guessing it is some sort of modern coupled simulation? The reason I changed to u-be195 was because we decided we need to begin is close to the official CMIP6 preindustrial run (u-ar766) as possible, which I therefore copied to become u-be195. Differencing the suites would therefore be difficult, because they are likely to be very very different in their setup. Clearly, though, one of these differences is causing the blowup. Although, as I said, in my recent experience this type of failure is normally due to something being wrong in one of the ancillaries e.g. a mismatch of the mask or similar. It cannot be this, however, because the ancillaries are identical. The restart dump differs, of course, but why would this cause a blowup? If it was a mask issue with this, then u-bc205 would also fail because, although it is different to the one in u-be195, it is still using a modern mask whereas the ancillaries use an Eocene mask?

Charlie

comment:19 Changed 11 days ago by ros

Hi Charlie,

I have run u-be195 (before you made any changes) and that runs fine. I have also run it using your reconfigured start dump and that is ok too. So it is definitely pointing to a problem with one of the ancillary files. I think you are going to have to track down the offending file by a process of elimination.

Regards,
Ros.

comment:20 Changed 11 days ago by charlie

But then how is u-bc205, which is a very similar setup, working fine with exactly the same ancillaries? Surely there must be something in u-be195, which is not present in u-bc205, that is conflicting with the ancillaries?

comment:21 Changed 11 days ago by grenville

Charlie

Comparing the suites indicates a large number of differences - I daresay many of those differences are not problematic, but "very similar" clearly is not similar enough in this case. Ros's work is pointing to the new ancillary files - only by trial and error will the culprit be revealed. You could try simply changing one file at a time or trying a binary search for the bad file (there may be more than one of course) - that might speed up the search. Reduce the time required to try to get through the queues.

Grenville

comment:22 Changed 11 days ago by charlie

Sorry Grenville, what is a "binary search"?

Yes, I see what you mean. The only way, clearly, is for me to replace each one in turn with the standard, resubmitting in between. But, as you say, there might be more than one problematic one, so how would this be revealed this way?

At the moment, I am cleaning the suite each time it fails and restarting (i.e. building) from scratch, which is obviously adding an extra 20 minutes on each time I submit. Given that I am just swapping ancillaries in and out, is it possible for me to skip this step? If so, do I just turn off Build UM, Build Ocean and Reconfiguration?

Charlie

comment:23 Changed 10 days ago by grenville

Charlie

No need to rebuild or reconfigure, so yes, just turn them off.

Binary search would say replace ½ of the working ancils with your versions - if that works you've halved the problem - then replace ½ of the remaining original ancils with your versions & repeat. If the 1st step fails, you know that at least on of your ancils is the problem - in that case revert ½ of those files to the originals….

Hopefully by doing this, you'll narrow down the problem. If there are multiple bad files, you may need to iterate the procedure as problems are identified.

There's no silver bullet 'though

Grenville

Note: See TracTickets for help on using tickets.