Opened 4 weeks ago

Last modified 32 hours ago

#2739 assigned help

Reconfiguration failure

Reported by: anmcr Owned by: willie
Priority: high Component: UM Reconfiguration
Keywords: GRIB_API Cc:
Platform: ARCHER UM Version: 11.1

Description

Hello,

I had a job that was successfully running on Monsoon, and ported it to Archer. However, when I run I get a reconfiguration failure due to a failure to find GRIB_API. The attachement shows a screenshot of the error. This is similar to ticket #2655, but I'm afraid I'm not sure how to resolve the issue. The job id is u-bf120.

Thanks in advance for any suggestions.

Best wishes,

Andrew

Attachments (10)

for_ncas_ticket.JPG (89.2 KB) - added by anmcr 4 weeks ago.
for_willie_1.JPG (105.3 KB) - added by anmcr 8 days ago.
for_willie_2.JPG (107.6 KB) - added by anmcr 8 days ago.
for_willie_1.2.JPG (105.3 KB) - added by anmcr 8 days ago.
for_willie_3.JPG (281.2 KB) - added by anmcr 6 days ago.
for_willie_4.JPG (76.1 KB) - added by anmcr 6 days ago.
for_willie_5.JPG (114.6 KB) - added by anmcr 4 days ago.
for_willie_5.2.JPG (114.6 KB) - added by anmcr 4 days ago.
for_willie_configuration_file.JPG (95.8 KB) - added by anmcr 4 days ago.
for_willie_make_steps.JPG (97.5 KB) - added by anmcr 4 days ago.

Download all attachments as: .zip

Change History (41)

Changed 4 weeks ago by anmcr

comment:1 Changed 3 weeks ago by anmcr

Hello again,

I tried to make some progress on my own, as this work is quite high priority. Unfortunately I still haven't managed to make any further headway. I was wondering whether vn11.1 is actually running on Archer?

Many thanks,

Andrew

comment:2 Changed 3 weeks ago by ros

Hi Andrew,

Sorry for the delay. Yes UM11.1 is available, however, we don't support GRIB_API on ARCHER and haven't done for a long while. You say that you have run this suite on Monsoon; could you copy over the reconfigured files from there?

Regards,
Ros.

comment:3 Changed 3 weeks ago by anmcr

Hi Ros,

Thanks for replying.

The reconfiguration failure is for the global N320 model. I had a look on Monsoon, and these files are held centrally at e.g. /projects/um1/ancil/atmos/n320e/land_sea_mask/igbp/v2/qrparm.mask. I think that these are the actual final versions of the files. I could copy these over to Archer. However, I'm unsure where to copy them to. Would it be somewhere like: /work/n02/n02/anmcr/cylc-run/u-bf120/share/cycle/19800701T0000Z/glm/um?

Thanks,

Andrew

comment:4 Changed 3 weeks ago by ros

Hi Andrew,

All the files under /projects/um1/ancil/atmos/n320e on Monsoon should already be on ARCHER under /work/y07/y07/umshared/ancil/atmos/n320e.

Hope that helps.

Regards,
Ros.

comment:5 Changed 3 weeks ago by anmcr

Hi Ros,

Thanks for the information.

I'm unsure where I should be copying the files on ARCHER under /work/y07/y07/umshared/ancil/atmos/n320e to? Can you please advise?

I was unable to find an analogous location on Monsoon that I could copy.

Andrew

comment:6 Changed 3 weeks ago by ros

Hi Andrew,

I think we're getting confused between start dumps and ancillary files. The files under /work/y07/y07/umshared/ancil are ancil files and you don't need to copy them anywhere just point your suite to them. When I said could you copy over the reconfigured files from Monsoon, I meant the start dumps that were created when the reconfiguration ran on Monsoon. Then point your ARCHER suite to these files and thus avoid having to run the reconfiguration on GRIB files.

Regards,
Ros.

comment:7 Changed 3 weeks ago by anmcr

Hi Ros,

If it start dumps, then I'm afraid that copying them over from Monsoon is not going to work. This will be a 35-year 'free run', run in 6-hourly (CRUN) chunks - so I assume that the reconfiguration steps will be ran every 6-hrs also.

I had intended to do this run on Monsoon, but do not have enough usage allocation. However, I do have available MAUs on ARCHER that I can use - hence why I am trying to copy the job over from Monsoon to ARCHER.

Best wishes,

Andrew

comment:8 Changed 3 weeks ago by anmcr

Ros,

At the moment my run is being forced by ERA-Interim atmospheric fields (suite conf > jinja.suite.rc > Driving model setup > dm_ic_file), as well as SST and seaice fields (suite conf > glm_um > Reconfiguration and Ancillary Control > Config ancils and initialse dump fields), all of which are GRIB files. If I was to convert these files to NetCDF, would the model be able to read them?

The available format types (input_dump_type) seem to be 'UM', 'GRIB', 'GRIB2FF' - so maybe not. I am unsure how I would go about converting my GRIB files to UM format.

Thanks,

Andrew

comment:9 Changed 3 weeks ago by grenville

Hi Andrew

The model doesn't read netcdf (in this context) - what is the suite id of the Monsoon job which handles grib files?

Grenville

comment:10 Changed 3 weeks ago by anmcr

Hi Grenville,

The Monsoon suite is u-be146.

Thanks,

Andrew

comment:11 Changed 2 weeks ago by grenville

Andrew

We are working on getting grib working on ARCHER.

Grenville

comment:12 Changed 13 days ago by willie

Hi Andrew,

I am trying to install the GRIB API on ARCHER, but I have encountered some problems with linking the executable that are likely to take some time to resolve. It might be quicker to run your model on Monsoon/NEXCS.

I'll let you know when I get it working on ARCHER.

Willie

comment:13 Changed 13 days ago by willie

  • Owner changed from um_support to willie
  • Status changed from new to assigned

comment:14 Changed 12 days ago by anmcr

Hi Willie,

Thanks for the update. I am part of the 'polar' group on Monsoon, but unfortunately we have used up our allocation and Monsoon is full - hence why I am wanting to run on Archer as I still have 20 MAUS on that machine.

Best wishes,

Andrew

comment:15 Changed 10 days ago by willie

Hi Andrew,

I've installed the GRIB_API but I need some of your ancillaries in order to test it. I can't find

/work/n02/n02/anmcr/start_files/Antarctic_Cordex/sic_sst/19791201_00-19801231_18_sic_um.grid

Could you reinstate or create this please?

Willie

comment:16 Changed 10 days ago by willie

Also the _sst_ version too, please

Willie

comment:17 Changed 10 days ago by anmcr

Hi Willie,

Many thanks for your help with this. I realise that this has been a considerable effort to get this to work.

The pathname has changed slightly, and now includes 'glm':

/home/n02/n02/anmcr/work/start_files/Antarctic_Cordex/sic_sst/glm/19791201_00-19801231_18_sic_um_grid

/home/n02/n02/anmcr/work/start_files/Antarctic_Cordex/sic_sst/glm/19791201_00-19801231_18_sst_um_grid

Best wishes,

Andrew

comment:18 Changed 9 days ago by willie

Hi Andrew,

I now have a version of u-bf120 which reconfigures the GRIB file. You need to make the following changes to your model.

In the fcm_make app, go to the Configuration file and change the config_root_path to

fcm:um.x-br/dev/williammcginty/vn11.1_NCAS_GRIB_API

Delete the config_revision.

On the Sources page, add the um_source

fcm:um.x-br/dev/williammcginty/vn11.1_NCAS_GRIB_API

If you haven't already done so, correct the ancillary filenames as in the previous comment.

Then build the code again and run it.

Willie

comment:19 Changed 8 days ago by anmcr

Dear Willie,

Thanks for all your effort with this. I made the changes you suggested in u-bf120, which I think was done correctly (though I was a bit confused whether i should add 'um_source' as you suggested or 'um_sources' which was already in the file ../u-bf120/app/fcm_make/rose-app.conf). See the attached screenshots. I've copied the error below. It refers I think to an issue concerning line 3 of /home/anmcr/cylc-run/u-bf120/work/19880101T0000Z/fcm_make/fcm-make.cfg, which is the line 'extract.location{diff}[um] = $um_sources'. I'm afraid that I was unable to solve this myself, so I would appreciate if you could advise.

Best wishes,

Andrew


anmcr@puma:/home/anmcr/cylc-run/u-bf120/log/job/19880101T0000Z/fcm_make/01> more job.err
[FAIL] /home/anmcr/cylc-run/u-bf120/work/19880101T0000Z/fcm_make/fcm-make.cfg:3: reference to undefined variable
[FAIL] include =
[FAIL] undef($config_revision)

[FAIL] fcm make -f /home/anmcr/cylc-run/u-bf120/work/19880101T0000Z/fcm_make/fcm-make.cfg -C /home/anmcr/cylc-run/u-bf120/share/fcm_make -j 4 —ignor
e-lock mirror.target=login.archer.ac.uk:cylc-run/u-bf120/share/fcm_make mirror.prop{config-file.name}=2 # return-code=9
Received signal ERR
cylc (scheduler - 2019-02-08T19:57:52Z): CRITICAL Task job script received signal ERR at 2019-02-08T19:57:52Z
cylc (scheduler - 2019-02-08T19:57:52Z): CRITICAL failed at 2019-02-08T19:57:52Z

Changed 8 days ago by anmcr

Changed 8 days ago by anmcr

Changed 8 days ago by anmcr

comment:20 Changed 8 days ago by willie

Hi Andrew,
You need to add the config_revision back in, but leave the box empty .
Willie

comment:21 Changed 6 days ago by anmcr

Hi Willie,

Thanks again for looking at this. The run worked up until the 'archive' stage of the nested model. I'm afraid that I was unable to fix it. I have attached a screen shot of the error. Are you able to advise?

Could you also please advise me on how to compute the number of AUs the run will use. Presumably this is based on the number of processors and the wallclock time.

Thanks again,

Andrew

Changed 6 days ago by anmcr

comment:22 Changed 6 days ago by willie

Hi Andrew,

The convpp program hasn't been built:

[FAIL] /home/n02/n02/anmcr/cylc-run/u-bf832/app/archive/bin/umpp: line 11: /home/n02/n02/anmcr/cylc-run/u-bf832/share/fcm_make_um_utils/um-convpp: No such file or directory

You do this on the fcm_make → Make Steps page near the bottom.

Willie

Changed 6 days ago by anmcr

comment:23 Changed 6 days ago by anmcr

Hi Willie,

Thanks again for the reply and help.

Attached is a screen shot of my 'make steps' page. All the options are already set to 'yes', and I there isn't an explicit mention of 'convpp'. Could you please advise?

Also, could you please advise how to work out the AU cost of a run (given a number of processors and wallclock time)?

Best wishes,

Andrew

comment:24 Changed 5 days ago by willie

Hi Andrew,

In the Rose GUI, select the View drop down menu and tick "View all ignored variables" and "View Latent variables".

Willie

comment:25 Changed 4 days ago by anmcr

Hi Willie,

Thanks for your help. I have included 'convpp' in the build as you suggested.

I re-ran the model, however the task 'install_glm_startdata' is not submitting, and is constantly 'retrying' - it has tried to submit 9 times, without success (since yesterday evening). I know that Archer is running fine.

Is there anything I can do about this?

Best wishes,

Andrew

comment:26 Changed 4 days ago by willie

Hi Andrew,
The job.out file has

Source /work/n02/n02/anmcr/start_files/Antarctic_Cordex/atmos/19970101_00.grib does not exist

Willie

comment:27 Changed 4 days ago by anmcr

Hi Willie,
I solved the above problem.
I will let you know shortly whether the run has completed or not.
Many thanks,
Andrew

comment:28 Changed 4 days ago by anmcr

Hi Willie,

I made the change you suggested to compile um-convpp. See attachement. However, its still not being compiled, and the 'no such file or directory' error persists: /home/n02/n02/anmcr/cylc-run/u-bf906/share/fcm_make_um_utils/um-convpp: No such file or directory.

Are you able to please advise, again.

Many thanks,

Andrew

Changed 4 days ago by anmcr

Changed 4 days ago by anmcr

comment:29 Changed 4 days ago by willie

Hi Andrew,

The GUI is complaining that selecting compile convpp is a problem - the little red triangle has appeared and the cog wheel is red. It seems that compile of convpp should not be switched on directly, as I thought originally. You need to go to the configuration file page and select 'serial utilities' as the config type. This automatically selects the appropriate compiles including convpp. Then do a new build.

Willie

Changed 4 days ago by anmcr

Changed 4 days ago by anmcr

comment:30 Changed 4 days ago by anmcr

Hi Willie,

I did as you suggested. Attached are screen shots of the 'configuration file' and 'make steps' pages. Unfortunately it now fails on the build step, complaining that it can't find various files. (Note that I assumed that I would have to rebuild the executable, so I ran rose suite-new —restart to make sure that I had a clean suite.) Looking at the configuration file page, it seems that the atmospheric model and reconfiguration executables are now not being compiled. Can you please advise?

Thanks for all your help.

Best wishes,

Andrew

[FAIL] config-file=/home/anmcr/cylc-run/u-bf832/work/19970101T0000Z/fcm_make/fcm-make.cfg:3
[FAIL] config-file= - https://code.metoffice.gov.uk/svn/um/main/branches/dev/williammcginty/vn11.1_NCAS_GRIB_API/fcm-make/ncas-xc30-cce/um-utils-serial-high.cfg
[FAIL] https://code.metoffice.gov.uk/svn/um/main/branches/dev/williammcginty/vn11.1_NCAS_GRIB_API/fcm-make/ncas-xc30-cce/um-utils-serial-high.cfg: cannot load config file
[FAIL] https://code.metoffice.gov.uk/svn/um/main/branches/dev/williammcginty/vn11.1_NCAS_GRIB_API/fcm-make/ncas-xc30-cce/um-utils-serial-high.cfg: not found
[FAIL] svn: E215004: Authentication failed and interactive prompting is disabled; see the —force-interactive option
[FAIL] svn: E215004: Unable to connect to a repository at URL 'https://code.metoffice.gov.uk/svn/um/main/branches/dev/williammcginty/vn11.1_NCAS_GRIB_API/fcm-make/ncas-xc30-cce/um-utils-serial-high.cfg'
[FAIL] svn: E215004: No more credentials or we tried too many times.
[FAIL] Authentication failed

[FAIL] fcm make -f /home/anmcr/cylc-run/u-bf832/work/19970101T0000Z/fcm_make/fcm-make.cfg -C /home/anmcr/cylc-run/u-bf832/share/fcm_make -j 4 —ignore-lock mirror.target=login.archer.ac.uk:cylc-run/u-bf832/share/fcm_make mirror.prop{config-file.name}=2 # return-code=1
Received signal ERR
cylc (scheduler - 2019-02-13T11:55:06Z): CRITICAL Task job script received signal ERR at 2019-02-13T11:55:06Z
cylc (scheduler - 2019-02-13T11:55:06Z): CRITICAL failed at 2019-02-13T11:55:06Z

comment:31 Changed 32 hours ago by willie

Hi Andrew,

This suite is designed for use on Monsoon. It needs to be modified for ARCHER as follows.

  • Don't try to compile um-convpp - there is a central one, so just return to compiling the atmosphere/reconfiguration.
  • In the archive app umpp script add the following line after the export
mkdir -p /nerc/n02/n02/$USER/$ROSE_SUITE_NAME/field.pp

and change the um-convpp to the central one:

$UMDIR/vn11.1/cce/utilities/um-convpp $1 $2
  • In the archive optional configurations (-cb, -ff, -ic) change moo put -F to scp and add the line
target-prefix=/nerc/n02/n02/$USER/$ROSE_SUITE_NAME/

to the end of each file. Note the slash at the end of the line.

The suite should then work.

Willie

Note: See TracTickets for help on using tickets.