Opened 6 weeks ago

Last modified 3 weeks ago

#2244 accepted help

CG3.1 on Archer/MonSoon

Reported by: ajd Owned by: ros
Priority: normal Component: Coupled model
Keywords: Cc:
Platform: ARCHER UM Version: 10.7

Description

Hello,

I will be running a large ensemble with N96-ORCA1 at GC3.1, with EASY-AEROSOL instead of GLOMAP. We will be running the ensemble on Monsoon but at this stage I am trying to set up a test run on Archer (since we don't have access to Monsoon yet), using a suite from Till Kuhlbrodt which is set up to run at the Met Office (u-am927).

Is it fairly straightforward to get this to run on Archer? If so what steps do I need to follow? We will want to use the CMIP6 version of HadGEM3 for the large ensemble, but it would be useful for us to test the model in a technical sense beforehand using this suite.

Is this do-able?

Many thanks.

Kind regards,
Andrea

PS: this is related to Chris Smith's (Leeds) atmosphere-only runs at GC3.1 which will produce the easy-aerosol input files for our runs

Attachments (1)

Screenshot 2017-08-15 12.05.02.png (91.3 KB) - added by ajd 5 weeks ago.
error message

Download all attachments as: .zip

Change History (30)

comment:1 Changed 6 weeks ago by ros

  • Owner changed from um_support to ros
  • Status changed from new to accepted

comment:2 Changed 6 weeks ago by ajd

Hi Ros,

Just to add some more detail, this is also related to Jane Mulcahy's email re GA7.1 on Archer for Chris Smith. We want to make sure the atmosphere-only and coupled simulations are run at the same model version.

Thanks,
Andrea

comment:3 Changed 6 weeks ago by ros

Hi Andrea,

Till's suite has the site independent setup in it so all the cylc definition changes for ARCHER can be flicked on with a switch, which makes things a lot easier. It would mainly just be picking up the correct Archer UM configuration files, moci modules, replacing central Met Office paths with Archer equivalent ones and copying over any non-standard input data. So, yes, it is certainly doable. I'm happy to advise on appropriate settings should you decide to go ahead with this.

Cheers,
Ros.

comment:4 Changed 6 weeks ago by ros

Sorry, forgot to ask how long you think it will be before you get Monsoon access? Are you just waiting for accounts to be created?

Cheers,
Ros.

comment:5 Changed 6 weeks ago by ajd

Hi Ros,

Excellent news, thanks. Yes it would be great if we could try to get this to run on Archer if possible. Would you have some time tomorrow to briefly discuss the changes needed in person or is it better to do it through this ticket? Regarding Monsoon access, I have applied for a user account yesterday but my understanding is that we are still waiting for the project to be approved as well. I am not sure how long that is expected to take. Ed might know - I forgot to cc him in this ticket and it seems that I can't add him now. I'll try and find out.

Cheers,
Andrea

comment:6 Changed 6 weeks ago by ros

Hi Andrea,

Is that the SMURPHS project? If so it's going through the approval process now. The information I have here is that it is due to start in October.

Yes, I'm around most of tomorrow.

Cheers,
Ros.

comment:7 Changed 6 weeks ago by ajd

Hi Ros,

Yes it's the SMURPHS project.

Cheers,
Andrea

comment:8 Changed 6 weeks ago by ros

Hi Andrea,

As discussed the start files and some of the ancil files are now on ARCHER under /work/n02/n02/ros/andrea. Let me know once you've copied them to your space so I can delete.

Cheers,
Ros.

comment:9 Changed 5 weeks ago by ajd

Hi Ros,

Thanks again for your help! I have now copied the files over. I will try to set up the test run and see if any more files are missing.

Cheers,
Andrea

Changed 5 weeks ago by ajd

error message

comment:10 Changed 5 weeks ago by ajd

Hi Ros,

I have made all the changes we discussed last week but I am getting an error when I try to run the model, I have attached a screenshot of the error message. Do you know what the problem could be?

Thanks,
Andrea

comment:11 Changed 5 weeks ago by ros

Hi Andrea,

You'll need to change that path to use the FCM keywords so it resolves ok on all platforms. Change it to:

fcm:um.xm-tr/rose-stem/ana/mule_cumf.py@vn10.7

It'll be set somewhere in the suite conf section of rose edit.

Cheers,
Ros.

comment:12 Changed 5 weeks ago by ajd

Thanks Ros, that solved it!

I am now having another issue: I get the following error in fcm_make_drivers: [FAIL] [Errno 2] No such file or directory: '/export/puma/data-01/training/ajd/fcm_make_drivers.22131201T0000Z.u-ap1649GAwnQ'
Received signal ERR

http://puma.nerc.ac.uk/rose-bush/view/ajd/u-ap164?&no_fuzzy_time=0&path=log/job/22131201T0000Z/fcm_make_drivers/01/job.err

It seems like this is another path issue but I can't work out where this path is set and what it should point to?

Cheers,
Andrea

comment:13 Changed 5 weeks ago by ajd

Hi Ros,

Just an update on this: I worked out this happens because fast-dest-root-orig in archer.rc points to $SCRATCH which in turn points to a directory that doesn't exist. So I just need to work out where SCRATCH is defined and what it should point to instead!

Thanks,
Andrea

comment:14 Changed 5 weeks ago by ros

Hi Andrea,

You need to set SCRATCH in your ~/.profile or ~/.kshrc on PUMA so that the extracts use temporary disk on PUMA to save space:

export SCRATCH=/export/puma/data-01/ajd

I just need to ask Andy to create that scratch directory for you and you should then be good to try again.

Cheers,
Ros.

comment:15 Changed 5 weeks ago by ajd

Hi Ros,

Thanks! In the meantime I have set it to /home/ajd/tmp and that seems to work.

I now have another issue as app/validate_suite_info/rose-app.conf seems to require python2.7 but this doesn't seem to be installed, is there a workaround for this?

Thanks again,
Andrea

comment:16 Changed 5 weeks ago by ros

Hi Andrea,

Andy has now created you the SCRATCH directory.

validate_suite_info is running on PUMA if I've understood correctly so you will need to change $PATH to add python2.7 to your environment as it's not yet standard on PUMA. Add it to the task setup in suite.rc file like so:

[[validate_suite_info]]
    pre-script = "export PATH=/home/andy/Enthought/Canopy_64bit/User/bin:$PATH"
        [[[environment]]]
            ROSE_TASK_APP = validate_suite_info

Cheers,
Ros.

comment:17 Changed 5 weeks ago by ajd

Hi Ros,

Thanks. I was able to add python2.7 but it seems that this wasn't the issue, I still get an error in validate suite info but I don't know what the problem is.

Do you know what could be happening here?

Cheers,
Andrea

Traceback (most recent call last):
  File "/home/ajd/cylc-run/u-ap164/bin/validate_suite_info.py", line 271, in <module>
    main()
  File "/home/ajd/cylc-run/u-ap164/bin/validate_suite_info.py", line 251, in main
    warnings, errors = check_experiment(suite_info, cv_experiment_id)
  File "/home/ajd/cylc-run/u-ap164/bin/validate_suite_info.py", line 155, in check_experiment
    '').format(key, cv_key)
ValueError: zero length field name in format
[FAIL] python $CYLC_SUITE_RUN_DIR/bin/validate_suite_info.py $CYLC_SUITE_RUN_DIR # return-code=1
Received signal ERR
Last edited 5 weeks ago by ros (previous) (diff)

comment:18 Changed 5 weeks ago by ros

Hi Andrea,

The only obvious thing I can see is that the path near the top of hte validate_suite_info.py script needs to be changed to reflect the location of the rose installation on PUMA which is /home/fcm/rose-2017.05.0/lib/python.

If that doesn't make any difference, we'd need to do some more debugging, however I'm wondering if you even really need to run the validate_suite_info task. It just looks like it is checking CMIP6 metadata again the CMIP6 controlled vocabulary. If this isn't important to you I would suggest the easiest route would be to just remove the validate_suite_info task from the graph in the suite.rc file.

Cheers,
Ros.

comment:19 Changed 5 weeks ago by ajd

Hi Ros,

Thanks. I have commented out/removed the entire line in suite.rc under 'graph' that contains validate_suite_info, is that what you mean?
That seems to work, but I now get an error in fcm_make_um:

[FAIL] file:fcm-make.cfg=source: FCM_MAKE_FILE: unbound variable
Received signal ERR[FAIL] file:fcm-make.cfg=source: FCM_MAKE_FILE: unbound variable
Received signal ERR

This seems to suggest FCM_MAKE_FILE is undefined. When I look in suite suite.rc, L190: FCM_MAKE_UM will only be set to main if SITE is the same as SINGLE_FCMUM, which in this case it is not as SITE is Archer and SINGLE_FCMUM is set to 'meto-cray' line 42.

So, from what I understand it should go to fcm_make2_um but it still runs fcm_make_um with FCM_MAKE_FILE undefined.

What is the best way to fix this? Should fcm_make_um still be run on Archer?

Thanks for all your help!
Andrea

comment:20 Changed 5 weeks ago by ros

Hi Andrea,

There is a mistake in the suite.rc file; FCM_MAKE_FILE needs to be set for the [[fcm_make_um]] in both the if and else. So I think you just need to add:

     [[[environment]]]
          FCM_MAKE_FILE = slash

and this will work.

Cheers,
Ros.

Last edited 5 weeks ago by ros (previous) (diff)

comment:21 Changed 5 weeks ago by ajd

Hi Ros,

Thanks again! This worked.

I next had a conflict between these two branches:
branches/dev/benjohnson/vn10.7_easyaerosol_cmip6@36305
branches/dev/nicolasbellouin/vn10.7_easyaerosol_v3@42608

If I remove the first one completely and only keep the second, fcm_make_um completes successfully but the subsequent tasks all have the following status: submit-retrying (e.g. fcm_make2_um). So far I've neither had a task complete or an error associated with a failure to investigate.

How should I go about continuing from here?

Cheers,
Andrea

comment:22 Changed 5 weeks ago by ros

Hi Andrea,

When you get a "submit-retrying" you will usually find an error message in the "job-activity.log" either via the cylc GUI or look in the cylc-run log directory (e.g. ~/cylc-run/u-ap164/log/job/22131201T0000Z/fcm_make2_um/01/job-activity.log) In this case the problem is an invalid budget code. The budgets are case sensitive so changing the budget to n02-SMURPHS should allow the tasks to submit successfully.

Cheers,
Ros.

comment:23 Changed 5 weeks ago by ajd

Hi Ros,

Thanks!

I've had another error after this where the path to the module file was wrong. I've had to hardcode the path for to the module file and it seems ok now (but I still don't why $UMDIR was replaced by /home/um/ on archer despite changing my .profile on both puma and archer).

The next error I get is this:

[FAIL] file:/work/n02/n02/ajd/cylc-run/u-ap164/share/data/etc/um_ancils_gl=source=fcm:ancil_data.xm_tr/ancil_versions/n96e_orca1/GA7.1/v1/ancils@4110: bad or missing value
Received signal ERR
cylc (scheduler - 2017-08-16T15:11:24Z): CRITICAL Task job script received signal ERR at 2017-08-16T15:11:24Z
cylc (scheduler - 2017-08-16T15:11:24Z): CRITICAL failed at 2017-08-16T15:11:24Z

Can you tell if there is a file missing or is this a path issue again? This the only task that seems to have fail now before the reconfiguration step can run.

Cheers,
Andrea

comment:24 Changed 5 weeks ago by ros

Hi Andrea,

ARCHER can't see the FCM repositories so you will need to manually extract the ancillary file from the repository:

fcm co fcm:ancil_data.xm_tr/ancil_versions/n96e_orca1/GA7.1/v1/ancils@4110

Then copy it over to ARCHER and change the path in the suite accordingly.

I think you will also have a problem with the STASHmaster file for similar reasons. There is a workaround I can give you to get it to extract the STASHmaster on PUMA and then copy it over to ARCHER, but doing it manually is probably easiest, at least for now.

Cheers,
Ros.

comment:25 Changed 5 weeks ago by ajd

Hi Ros,

Thanks, the build stage is fine now and the model is at the reconfiguration stage now.

I'm missing the following files:

app/um/rose-app.conf:SPECTRAL_FILE_DIR=$CMIP6_ANCILS/any/clim_1850-1873/Solar/v3.2_picontrol_cmip6_solar_244
app/um/rose-app.conf:ancilfilename='$CMIP6_ANCILS/n96e/timeslice_1850/OzoneConc/v1/mmro3_monthly_CMIP6_1850_N96_edited-ancil_2anc'
app/um/rose-app.conf:ancilfilename='$CMIP6_ANCILS/n96e/timeslice_1850/LandUse/v2/veg.frac.n96e.orca1.v2.2x.1850'

Can you tell me if they are on Archer or would be able to copy them over if not?

Many thanks!
Andrea

comment:26 Changed 5 weeks ago by ajd

Hi Ros,

Actually I think I've found them!

Cheers,
Andrea

comment:27 Changed 5 weeks ago by ajd

Hi Ros,

I've managed to build the model and run the reconfiguration, but I now get the following error when trying to run the model:

????????????????????????????????????????????????????????????????????????????????
???!!!???!!!???!!!???!!!???!!! ERROR ???!!!???!!!???!!!???!!!???!!!
? Error code: 1
? Error from routine: OASIS_INITA2O
? Error message:
? 1 item(s) missing from STASH coupling fields.
? Please refer to standard output for full details.
? Error from processor: 142
? Error number: 85
????????????????????????????????????????????????????????????????????????????????

I think this is likely a conflict/issue arising from trying to modify Till's job to run with easy-aerosol (following instructions from Nicolas Bellouin). I have added a STASHmaster file follwing Nicolas' instruction, perhaps this is where the issue is coming from (possibly not coupled)?

What is the best way to go about resolving this?

Cheers,
Andrea

comment:28 Changed 4 weeks ago by ros

Hi Andrea,

Sorry I've not got back to you on this. Is this still a problem? If so I would first suggest running the model as is, before making any changes. That way it should be easy to determine if it is the new STASHmaster file causing the problem or not.

Regards,
Ros.

comment:29 Changed 3 weeks ago by ajd

Hi Ros,

Thanks for your reply. Yes it is still a problem but I will try running the model as is first and see if my changes are causing the problem or not as you suggest. I'll get back to you once I've done this.

Cheers,
Andrea

Note: See TracTickets for help on using tickets.