Opened 3 months ago

Closed 4 weeks ago

#2739 closed help (fixed)

Reconfiguration failure

Reported by: anmcr Owned by: willie
Component: UM Reconfiguration Keywords: GRIB_API
Cc: Platform: ARCHER
UM Version: 11.1

Description

Hello,

I had a job that was successfully running on Monsoon, and ported it to Archer. However, when I run I get a reconfiguration failure due to a failure to find GRIB_API. The attachement shows a screenshot of the error. This is similar to ticket #2655, but I'm afraid I'm not sure how to resolve the issue. The job id is u-bf120.

Thanks in advance for any suggestions.

Best wishes,

Andrew

Attachments (10)

for_ncas_ticket.JPG (89.2 KB) - added by anmcr 3 months ago.
for_willie_1.JPG (105.3 KB) - added by anmcr 2 months ago.
for_willie_2.JPG (107.6 KB) - added by anmcr 2 months ago.
for_willie_1.2.JPG (105.3 KB) - added by anmcr 2 months ago.
for_willie_3.JPG (281.2 KB) - added by anmcr 2 months ago.
for_willie_4.JPG (76.1 KB) - added by anmcr 2 months ago.
for_willie_5.JPG (114.6 KB) - added by anmcr 2 months ago.
for_willie_5.2.JPG (114.6 KB) - added by anmcr 2 months ago.
for_willie_configuration_file.JPG (95.8 KB) - added by anmcr 2 months ago.
for_willie_make_steps.JPG (97.5 KB) - added by anmcr 2 months ago.

Download all attachments as: .zip

Change History (52)

Changed 3 months ago by anmcr

comment:1 Changed 3 months ago by anmcr

Hello again,

I tried to make some progress on my own, as this work is quite high priority. Unfortunately I still haven't managed to make any further headway. I was wondering whether vn11.1 is actually running on Archer?

Many thanks,

Andrew

comment:2 Changed 3 months ago by ros

Hi Andrew,

Sorry for the delay. Yes UM11.1 is available, however, we don't support GRIB_API on ARCHER and haven't done for a long while. You say that you have run this suite on Monsoon; could you copy over the reconfigured files from there?

Regards,
Ros.

comment:3 Changed 3 months ago by anmcr

Hi Ros,

Thanks for replying.

The reconfiguration failure is for the global N320 model. I had a look on Monsoon, and these files are held centrally at e.g. /projects/um1/ancil/atmos/n320e/land_sea_mask/igbp/v2/qrparm.mask. I think that these are the actual final versions of the files. I could copy these over to Archer. However, I'm unsure where to copy them to. Would it be somewhere like: /work/n02/n02/anmcr/cylc-run/u-bf120/share/cycle/19800701T0000Z/glm/um?

Thanks,

Andrew

comment:4 Changed 3 months ago by ros

Hi Andrew,

All the files under /projects/um1/ancil/atmos/n320e on Monsoon should already be on ARCHER under /work/y07/y07/umshared/ancil/atmos/n320e.

Hope that helps.

Regards,
Ros.

comment:5 Changed 3 months ago by anmcr

Hi Ros,

Thanks for the information.

I'm unsure where I should be copying the files on ARCHER under /work/y07/y07/umshared/ancil/atmos/n320e to? Can you please advise?

I was unable to find an analogous location on Monsoon that I could copy.

Andrew

comment:6 Changed 3 months ago by ros

Hi Andrew,

I think we're getting confused between start dumps and ancillary files. The files under /work/y07/y07/umshared/ancil are ancil files and you don't need to copy them anywhere just point your suite to them. When I said could you copy over the reconfigured files from Monsoon, I meant the start dumps that were created when the reconfiguration ran on Monsoon. Then point your ARCHER suite to these files and thus avoid having to run the reconfiguration on GRIB files.

Regards,
Ros.

comment:7 Changed 3 months ago by anmcr

Hi Ros,

If it start dumps, then I'm afraid that copying them over from Monsoon is not going to work. This will be a 35-year 'free run', run in 6-hourly (CRUN) chunks - so I assume that the reconfiguration steps will be ran every 6-hrs also.

I had intended to do this run on Monsoon, but do not have enough usage allocation. However, I do have available MAUs on ARCHER that I can use - hence why I am trying to copy the job over from Monsoon to ARCHER.

Best wishes,

Andrew

comment:8 Changed 3 months ago by anmcr

Ros,

At the moment my run is being forced by ERA-Interim atmospheric fields (suite conf > jinja.suite.rc > Driving model setup > dm_ic_file), as well as SST and seaice fields (suite conf > glm_um > Reconfiguration and Ancillary Control > Config ancils and initialse dump fields), all of which are GRIB files. If I was to convert these files to NetCDF, would the model be able to read them?

The available format types (input_dump_type) seem to be 'UM', 'GRIB', 'GRIB2FF' - so maybe not. I am unsure how I would go about converting my GRIB files to UM format.

Thanks,

Andrew

comment:9 Changed 3 months ago by grenville

Hi Andrew

The model doesn't read netcdf (in this context) - what is the suite id of the Monsoon job which handles grib files?

Grenville

comment:10 Changed 3 months ago by anmcr

Hi Grenville,

The Monsoon suite is u-be146.

Thanks,

Andrew

comment:11 Changed 3 months ago by grenville

Andrew

We are working on getting grib working on ARCHER.

Grenville

comment:12 Changed 3 months ago by willie

Hi Andrew,

I am trying to install the GRIB API on ARCHER, but I have encountered some problems with linking the executable that are likely to take some time to resolve. It might be quicker to run your model on Monsoon/NEXCS.

I'll let you know when I get it working on ARCHER.

Willie

comment:13 Changed 3 months ago by willie

  • Owner changed from um_support to willie
  • Status changed from new to assigned

comment:14 Changed 2 months ago by anmcr

Hi Willie,

Thanks for the update. I am part of the 'polar' group on Monsoon, but unfortunately we have used up our allocation and Monsoon is full - hence why I am wanting to run on Archer as I still have 20 MAUS on that machine.

Best wishes,

Andrew

comment:15 Changed 2 months ago by willie

Hi Andrew,

I've installed the GRIB_API but I need some of your ancillaries in order to test it. I can't find

/work/n02/n02/anmcr/start_files/Antarctic_Cordex/sic_sst/19791201_00-19801231_18_sic_um.grid

Could you reinstate or create this please?

Willie

comment:16 Changed 2 months ago by willie

Also the _sst_ version too, please

Willie

comment:17 Changed 2 months ago by anmcr

Hi Willie,

Many thanks for your help with this. I realise that this has been a considerable effort to get this to work.

The pathname has changed slightly, and now includes 'glm':

/home/n02/n02/anmcr/work/start_files/Antarctic_Cordex/sic_sst/glm/19791201_00-19801231_18_sic_um_grid

/home/n02/n02/anmcr/work/start_files/Antarctic_Cordex/sic_sst/glm/19791201_00-19801231_18_sst_um_grid

Best wishes,

Andrew

comment:18 Changed 2 months ago by willie

Hi Andrew,

I now have a version of u-bf120 which reconfigures the GRIB file. You need to make the following changes to your model.

In the fcm_make app, go to the Configuration file and change the config_root_path to

fcm:um.x-br/dev/williammcginty/vn11.1_NCAS_GRIB_API

Delete the config_revision.

On the Sources page, add the um_source

fcm:um.x-br/dev/williammcginty/vn11.1_NCAS_GRIB_API

If you haven't already done so, correct the ancillary filenames as in the previous comment.

Then build the code again and run it.

Willie

comment:19 Changed 2 months ago by anmcr

Dear Willie,

Thanks for all your effort with this. I made the changes you suggested in u-bf120, which I think was done correctly (though I was a bit confused whether i should add 'um_source' as you suggested or 'um_sources' which was already in the file ../u-bf120/app/fcm_make/rose-app.conf). See the attached screenshots. I've copied the error below. It refers I think to an issue concerning line 3 of /home/anmcr/cylc-run/u-bf120/work/19880101T0000Z/fcm_make/fcm-make.cfg, which is the line 'extract.location{diff}[um] = $um_sources'. I'm afraid that I was unable to solve this myself, so I would appreciate if you could advise.

Best wishes,

Andrew


anmcr@puma:/home/anmcr/cylc-run/u-bf120/log/job/19880101T0000Z/fcm_make/01> more job.err
[FAIL] /home/anmcr/cylc-run/u-bf120/work/19880101T0000Z/fcm_make/fcm-make.cfg:3: reference to undefined variable
[FAIL] include =
[FAIL] undef($config_revision)

[FAIL] fcm make -f /home/anmcr/cylc-run/u-bf120/work/19880101T0000Z/fcm_make/fcm-make.cfg -C /home/anmcr/cylc-run/u-bf120/share/fcm_make -j 4 —ignor
e-lock mirror.target=login.archer.ac.uk:cylc-run/u-bf120/share/fcm_make mirror.prop{config-file.name}=2 # return-code=9
Received signal ERR
cylc (scheduler - 2019-02-08T19:57:52Z): CRITICAL Task job script received signal ERR at 2019-02-08T19:57:52Z
cylc (scheduler - 2019-02-08T19:57:52Z): CRITICAL failed at 2019-02-08T19:57:52Z

Changed 2 months ago by anmcr

Changed 2 months ago by anmcr

Changed 2 months ago by anmcr

comment:20 Changed 2 months ago by willie

Hi Andrew,
You need to add the config_revision back in, but leave the box empty .
Willie

comment:21 Changed 2 months ago by anmcr

Hi Willie,

Thanks again for looking at this. The run worked up until the 'archive' stage of the nested model. I'm afraid that I was unable to fix it. I have attached a screen shot of the error. Are you able to advise?

Could you also please advise me on how to compute the number of AUs the run will use. Presumably this is based on the number of processors and the wallclock time.

Thanks again,

Andrew

Changed 2 months ago by anmcr

comment:22 Changed 2 months ago by willie

Hi Andrew,

The convpp program hasn't been built:

[FAIL] /home/n02/n02/anmcr/cylc-run/u-bf832/app/archive/bin/umpp: line 11: /home/n02/n02/anmcr/cylc-run/u-bf832/share/fcm_make_um_utils/um-convpp: No such file or directory

You do this on the fcm_make → Make Steps page near the bottom.

Willie

Changed 2 months ago by anmcr

comment:23 Changed 2 months ago by anmcr

Hi Willie,

Thanks again for the reply and help.

Attached is a screen shot of my 'make steps' page. All the options are already set to 'yes', and I there isn't an explicit mention of 'convpp'. Could you please advise?

Also, could you please advise how to work out the AU cost of a run (given a number of processors and wallclock time)?

Best wishes,

Andrew

comment:24 Changed 2 months ago by willie

Hi Andrew,

In the Rose GUI, select the View drop down menu and tick "View all ignored variables" and "View Latent variables".

Willie

comment:25 Changed 2 months ago by anmcr

Hi Willie,

Thanks for your help. I have included 'convpp' in the build as you suggested.

I re-ran the model, however the task 'install_glm_startdata' is not submitting, and is constantly 'retrying' - it has tried to submit 9 times, without success (since yesterday evening). I know that Archer is running fine.

Is there anything I can do about this?

Best wishes,

Andrew

comment:26 Changed 2 months ago by willie

Hi Andrew,
The job.out file has

Source /work/n02/n02/anmcr/start_files/Antarctic_Cordex/atmos/19970101_00.grib does not exist

Willie

comment:27 Changed 2 months ago by anmcr

Hi Willie,
I solved the above problem.
I will let you know shortly whether the run has completed or not.
Many thanks,
Andrew

comment:28 Changed 2 months ago by anmcr

Hi Willie,

I made the change you suggested to compile um-convpp. See attachement. However, its still not being compiled, and the 'no such file or directory' error persists: /home/n02/n02/anmcr/cylc-run/u-bf906/share/fcm_make_um_utils/um-convpp: No such file or directory.

Are you able to please advise, again.

Many thanks,

Andrew

Changed 2 months ago by anmcr

Changed 2 months ago by anmcr

comment:29 Changed 2 months ago by willie

Hi Andrew,

The GUI is complaining that selecting compile convpp is a problem - the little red triangle has appeared and the cog wheel is red. It seems that compile of convpp should not be switched on directly, as I thought originally. You need to go to the configuration file page and select 'serial utilities' as the config type. This automatically selects the appropriate compiles including convpp. Then do a new build.

Willie

Changed 2 months ago by anmcr

Changed 2 months ago by anmcr

comment:30 Changed 2 months ago by anmcr

Hi Willie,

I did as you suggested. Attached are screen shots of the 'configuration file' and 'make steps' pages. Unfortunately it now fails on the build step, complaining that it can't find various files. (Note that I assumed that I would have to rebuild the executable, so I ran rose suite-new —restart to make sure that I had a clean suite.) Looking at the configuration file page, it seems that the atmospheric model and reconfiguration executables are now not being compiled. Can you please advise?

Thanks for all your help.

Best wishes,

Andrew

[FAIL] config-file=/home/anmcr/cylc-run/u-bf832/work/19970101T0000Z/fcm_make/fcm-make.cfg:3
[FAIL] config-file= - https://code.metoffice.gov.uk/svn/um/main/branches/dev/williammcginty/vn11.1_NCAS_GRIB_API/fcm-make/ncas-xc30-cce/um-utils-serial-high.cfg
[FAIL] https://code.metoffice.gov.uk/svn/um/main/branches/dev/williammcginty/vn11.1_NCAS_GRIB_API/fcm-make/ncas-xc30-cce/um-utils-serial-high.cfg: cannot load config file
[FAIL] https://code.metoffice.gov.uk/svn/um/main/branches/dev/williammcginty/vn11.1_NCAS_GRIB_API/fcm-make/ncas-xc30-cce/um-utils-serial-high.cfg: not found
[FAIL] svn: E215004: Authentication failed and interactive prompting is disabled; see the —force-interactive option
[FAIL] svn: E215004: Unable to connect to a repository at URL 'https://code.metoffice.gov.uk/svn/um/main/branches/dev/williammcginty/vn11.1_NCAS_GRIB_API/fcm-make/ncas-xc30-cce/um-utils-serial-high.cfg'
[FAIL] svn: E215004: No more credentials or we tried too many times.
[FAIL] Authentication failed

[FAIL] fcm make -f /home/anmcr/cylc-run/u-bf832/work/19970101T0000Z/fcm_make/fcm-make.cfg -C /home/anmcr/cylc-run/u-bf832/share/fcm_make -j 4 —ignore-lock mirror.target=login.archer.ac.uk:cylc-run/u-bf832/share/fcm_make mirror.prop{config-file.name}=2 # return-code=1
Received signal ERR
cylc (scheduler - 2019-02-13T11:55:06Z): CRITICAL Task job script received signal ERR at 2019-02-13T11:55:06Z
cylc (scheduler - 2019-02-13T11:55:06Z): CRITICAL failed at 2019-02-13T11:55:06Z

comment:31 Changed 2 months ago by willie

Hi Andrew,

This suite is designed for use on Monsoon. It needs to be modified for ARCHER as follows.

  • Don't try to compile um-convpp - there is a central one, so just return to compiling the atmosphere/reconfiguration.
  • In the archive app umpp script add the following line after the export
mkdir -p /nerc/n02/n02/$USER/$ROSE_SUITE_NAME/field.pp

and change the um-convpp to the central one:

$UMDIR/vn11.1/cce/utilities/um-convpp $1 $2
  • In the archive optional configurations (-cb, -ff, -ic) change moo put -F to scp and add the line
target-prefix=/nerc/n02/n02/$USER/$ROSE_SUITE_NAME/

to the end of each file. Note the slash at the end of the line.

The suite should then work.

Willie

comment:32 Changed 2 months ago by anmcr

Dear Willie,

It worked. Thanks for all your efforts to enable this.

However, when I ran a longer 'production' run I very quickly ran out of space on /nerc/n02/n02/anmcr. See below. Are you able to increase this? I have a number of long 40-yr production runs, so could do with as much as possible.

Many thanks,

Andrew

cp: writing `/nerc/n02/n02/anmcr/u-bg080/field.pp/19980101T0000Z_AntarcticCORDEX_0p44deg_ga6_pverc006.pp': Disk quota exceeded
[FAIL] cp: writing `/nerc/n02/n02/anmcr/u-bg080/field.pp/19980101T0000Z_AntarcticCORDEX_0p44deg_ga6_pvera012.pp': Disk quota exceeded
[FAIL] cp: writing `/nerc/n02/n02/anmcr/u-bg080/field.pp/19980101T0000Z_AntarcticCORDEX_0p44deg_ga6_pverb018.pp': Disk quota exceeded
[FAIL] cp: writing `/nerc/n02/n02/anmcr/u-bg080/field.pp/19980101T0000Z_AntarcticCORDEX_0p44deg_ga6_pverb012.pp': Disk quota exceeded
[FAIL] cp: writing `/nerc/n02/n02/anmcr/u-bg080/field.pp/19980101T0000Z_AntarcticCORDEX_0p44deg_ga6_pa000.pp': Disk quota exceeded
[FAIL] cp: writing `/nerc/n02/n02/anmcr/u-bg080/field.pp/19980101T0000Z_AntarcticCORDEX_0p44deg_ga6_pverd012.pp': Disk quota exceeded
[FAIL] cp: writing `/nerc/n02/n02/anmcr/u-bg080/field.pp/19980101T0000Z_AntarcticCORDEX_0p44deg_ga6_pverd000.pp': Disk quota exceeded
[FAIL] cp: writing `/nerc/n02/n02/anmcr/u-bg080/field.pp/19980101T0000Z_AntarcticCORDEX_0p44deg_ga6_pverc012.pp': Disk quota exceeded
[FAIL] cp: writing `/nerc/n02/n02/anmcr/u-bg080/field.pp/19980101T0000Z_AntarcticCORDEX_0p44deg_ga6_pverd006.pp': Disk quota exceeded
[FAIL] cp: writing `/nerc/n02/n02/anmcr/u-bg080/field.pp/19980101T0000Z_AntarcticCORDEX_0p44deg_ga6_pverd018.pp': Disk quota exceeded
[FAIL] cp: writing `/nerc/n02/n02/anmcr/u-bg080/field.pp/19980101T0000Z_AntarcticCORDEX_0p44deg_ga6_pverc000.pp': Disk quota exceeded
[FAIL] cp: writing `/nerc/n02/n02/anmcr/u-bg080/field.pp/19980101T0000Z_AntarcticCORDEX_0p44deg_ga6_pvera000.pp': Disk quota exceeded
[FAIL] cp: writing `/nerc/n02/n02/anmcr/u-bg080/field.pp/19980101T0000Z_AntarcticCORDEX_0p44deg_ga6_pvera006.pp': Disk quota exceeded
[FAIL] cp: writing `/nerc/n02/n02/anmcr/u-bg080/field.pp/19980101T0000Z_AntarcticCORDEX_0p44deg_ga6_pverc018.pp': Disk quota exceeded
[FAIL] cp: writing `/nerc/n02/n02/anmcr/u-bg080/field.pp/19980101T0000Z_AntarcticCORDEX_0p44deg_ga6_pverb000.pp': Disk quota exceeded
[FAIL] cp: writing `/nerc/n02/n02/anmcr/u-bg080/field.pp/19980101T0000Z_AntarcticCORDEX_0p44deg_ga6_pverb006.pp': Disk quota exceeded
[FAIL] cp: writing `/nerc/n02/n02/anmcr/u-bg080/field.pp/19980101T0000Z_AntarcticCORDEX_0p44deg_ga6_pvera018.pp': Disk quota exceeded

comment:33 Changed 2 months ago by willie

Hi Andrew,

The RDF has been topped up so you should be good to go.

Quick question: how did you create your ancillary file

/work/n02/n02/anmcr/start_files/Antarctic_Cordex/sic_sst/glm/19871201_00-19881231_12_sic_um_grid_glm

?
Is there a reason why these fields are not in the GRIB start dump? Someone else needs to do similar.

Willie

comment:34 Changed 2 months ago by anmcr

Hi Willie,

Thanks again for all your help.

In answer to your question, I was under the impression that SST and sea-ice ancillaries for e.g. the global model are set at glm_um → reconfiguration and ancillary control → configure ancils and initialise dump fields - and therefore not part of the start dump / analysis file (driving model setup → dm_ic_file). This therefore required me to make a separate set of ancillary files using Xancil.

Note that I am running twice daily 24h forecasts, and keeping the T+12-24h portion, and then concatenating them together to make a long time series. For the forecast run I only needed to force the global model with daily SST/seaice ancillaries, as the model then took this input and updated the lower boundary condition of the nested suite (ie the SST/seaice evolved with time throughout the model run).

Note that I am using a forecast run methodology as I couldn't get the 'free run' option to work reliably as the nested suite would occasionaly fail with 'Error Message: Mid conv went to the top of the model at point', due persumably to a grid point storm (see ticket #1639) - which I didn't have the time to investigate properly, but wasn't fixed by reducing the timestep). The 'free run' option requied me to update the SST/seaice ancils for both the global and nested suite separately, as otherwise the nested suite SST/seaice fields were stuck at the initial date and didn't evolve with time.

I hope this helps.

Best wishes,

Andrew

comment:35 Changed 2 months ago by willie

Thanks Andrew, that's very helpful.

Willie

comment:36 Changed 2 months ago by anmcr

Hi Willie,

I've got another issue, which should be straightforward to fix. The jobs are failing with 'submit-failed' errors, exactly as described in ticket #2447. See the end of this email for the series of 'ssh failed' warnings that I get when I type 'rose host-select archer' on the command line. Ros suggests a solution in ticket #2447, which although she refers to the suite.rc file, in my setup is likely /site/ncas-cray-xc30/suite-adds.rc. However, looking at the file it wasn't clear to me what changes I should make (she suggests setting host = login.archer.ac.uk) - I tried a few things, but the submit failures continued. Could you please advise.

Many thanks again.

Andrew

anmcr@puma:/home/anmcr/roses/u-bg289/site/ncas-cray-xc30> rose host-select archer
[WARN] login8.archer.ac.uk: (ssh failed)
[WARN] login4.archer.ac.uk: (ssh failed)
[WARN] login1.archer.ac.uk: (ssh failed)
[WARN] login6.archer.ac.uk: (ssh failed)
[WARN] login5.archer.ac.uk: (ssh failed)
[WARN] login2.archer.ac.uk: (ssh failed)
[WARN] login3.archer.ac.uk: (ssh failed)
[WARN] login7.archer.ac.uk: (ssh failed)
login.archer.ac.uk

comment:37 Changed 8 weeks ago by willie

Hi Andrew,

You just have to SSH into each directly on PUMA and then try rose host-select archer.

Willie

comment:38 Changed 8 weeks ago by anmcr

Hi Willie,

This was the advice given by Ros in ticket #2447. I'm pretty certain this is what I tried, but that the submit failures continued. I will check again later today and get back to you.

many thanks for your help.

Andrew

comment:39 Changed 8 weeks ago by anmcr

Hi Willie,

I did as you suggested, but the 'submit failed' error persisted.

Best wishes,

Andrew

comment:40 Changed 4 weeks ago by willie

Hi Andrew, Is this still a problem?

Willie

comment:41 Changed 4 weeks ago by anmcr

Hi Willie,

It certainly was a problem at the time I raised it. Due to time pressure to get the simulations completed I therefore finished the runs on Monsoon. I just submitted a run on Archer to test, and straight away there was a 'submit failed' error.

However, I have another two sets of nested suite simulations that I will need to run on Archer, beginning in the next week or so - so using the modified model you developed in this ticket in order to run on Archer. Can I suggest that we therefore close this ticket, and I start a new ticket if this is still an issue (which it likely will be)?

Many thanks for all your help.

Andrew

comment:42 Changed 4 weeks ago by willie

  • Resolution set to fixed
  • Status changed from assigned to closed
Note: See TracTickets for help on using tickets.