#2610 closed help (fixed)

failed to retrieve data from Monsoon

Reported by: amenon Owned by: um_support
Component: UM Model Keywords: MASS down
Cc: Platform: Monsoon2
UM Version: 10.9

Description

Hi,

My ensemble nesting suite is failing to retrieve the global forecasts archived at the Met Office. The install_engl_startdata is failing with the following error

UserWarning: Failed to retrieve data. Command 'moo get -f moose:/opfc/atm/global/rerun/201607.file/20160701T0000Z_glm_t+0 /home/d04/arame/cylc-run/u-bb030/share/cycle/20160701T0000Z/engl/ics/em0' returned non-zero exit status 1
2018-09-19T11:35:26Z CRITICAL - failed/EXIT

This job succeeded last week with out any error. Today I was re-running the suite using rose suite-run —new.

There was a planned MASS outage from 10-12 BST today. Does this error has something to do with that or is this something else? Please help.

Cheers

Change History (10)

comment:1 Changed 13 months ago by amenon

Just to add, I cannot do moo ls from the terminal too. This is what it gives:

arame@xcslc0:~> moo ls moose:/devfc/u-ar473/field.pp/
/opt/ukmo/mass/moose-monsoon-client-latest/bin/mooLaunch_Linux_x86_64: /lib64/libc.so.6: version `GLIBC_2.14' not found (required by /opt/ukmo/mass/moose-monsoon-client-latest/bin/mooLaunch_Linux_x86_64)

comment:2 Changed 13 months ago by willie

  • Keywords MASS down added; Failed to retrieve data removed

Hi Arathy,

That indicates MASS is still not up. I've informed the Monsoon help desk.

The install start data app is essential to your suite.

Regards
Willie

comment:3 Changed 13 months ago by amenon

Thanks Willie. The suite is still stuck with this issue. Please let me know if you hear from the Monsoon help desk.

Regards,
Arathy

comment:4 Changed 13 months ago by willie

Hi Arathy,

Monsoon have posted this on the "Monsoon Collaboration Service" Yammer group (worth signing up for):

The moose client is now available and working on the XCS-C login nodes. Please allow approx. 40 minutes for the deployment to the rest of the machine to take place. So should be available by 11:30 BST.

Regards
Willie

comment:5 Changed 13 months ago by amenon

Hi Willie,

The suite succeeded retrieving global forecasts once MASS was up. Currently, the LAM forecasts for two ensembles are taking quite a long time to run. I doubled the number of processors for LAM forecast job from 12X12 to 24X24. Still it is currently running for the past 4 days and not succeeding yet. Output files are created in the cylc-run, but they are empty. The suite id is u-bb030. Could you please have a look why is it taking so long?

Regards,
Arathy

comment:6 Changed 13 months ago by willie

Hi Arathy,

If you look at the job.out for INCOMPASS_km4p4_ra1t_inc4p4_um_fcst_em1_cr0 it says it has done 117 time steps and then run out of time at 1200 sec. This is only 20 minutes. So you just need to re-scale you job to allow it to complete the full run. Make sure that it does not exceed the queue limit - the normal queue has a 4 hour limit (use qstat -q to see this) . If it does you will need to modify the CRUN length.

Regards
Willie

comment:7 Changed 13 months ago by amenon

Hi Willie,

I am not able to figure out from where the LAM forecast job is taking the wallclock limit of 20 minutes. I tried adding "-l walltime = 04:00:00" and "execution time limit = PT4H" under LAM forecast directives in the suite-adds.rc file. It originally had "-l walltime = {{WALL_CLOCK_LIMIT}}". I also tried inserting "-q = {{SERIALQ}}" under the LAM directives. Still with all these changes, the LAM forecasts run out of time at 1200 sec.
I also tried increasing the wall_clock_limit in the general options window in the rose GUI, still with no luck. When I make changes in the general options window, do I need to start the suite new to get those changes incorporated? I was restarting the suite after making this change.

Regards,
Arathy

comment:8 Changed 13 months ago by willie

Hi Arathy,

In a couple of places in suite-adds.rc for Monsoon we have

            -l walltime = {{WALL_CLOCK_LIMIT}}

and WALL_CLOCK_LIMIT is set to 1200 in the suite file bin/setup_metadata. I'm not sure what's going on here.

You should read the ensemble suite notes at https://code.metoffice.gov.uk/trac/rmed/wiki/suites/ensemble/worked_eg_2017 and seek advice from the suite owner.

Regards
Willie

comment:9 Changed 13 months ago by amenon

Hi Willie,
The LAM forecast job succeeded now just by increasing the number of processors, without changing the wall clock time. I didn't change anything in the bin/setup_metadata. Earlier when I restarted/reloaded the suite after increasing the processors from 12X12 to 24X24, the suite kept failing because it was actually not taking up those changes even though I was reloading the suite. Then later I logged out of Monsoon and then logged back in again and then reloaded the suite and this time it took up the changes in the number of processors and the LAM forecast job succeeded. But I don't know why "rose suite-run —reload" didn't work in an already logged in session. Now we could close this ticket. Thanks.

Cheers

comment:10 Changed 13 months ago by willie

  • Resolution set to fixed
  • Status changed from new to closed
Note: See TracTickets for help on using tickets.