#2890 closed error (fixed)

11.3 job hanging on MONSooN2

Reported by: scottan Owned by: um_support
Component: UM Model Keywords: UKCA, SC0138
Cc: Platform: Monsoon2
UM Version: <select version>

Description

Hello,

Having the same problem again of getting stuck on submit-retying. This is one version 11.3, using a freshly copied suite from u-bg779. The MONSooN.rc file is different to previous versions so was not sure what to change.

http://www.ukca.ac.uk/wiki/index.php/GA7.1_StratTrop_suites#TS2000_free-running_suites

Note the most recent version I could select in the dropdown menu below was 11.2.

Many thanks,
Scott

Change History (14)

comment:1 Changed 10 months ago by luke

Hi Scott,

If you reset the state of the tasks to "waiting", does it then proceed but then hang again at the next cycle point?

I have been seeing this behaviour in one of my AerChemMIP AMIP experiments at vn11.1.

Due to mule changes the um_tools module needs to be loaded, and the scitools module was fixed at scitools/production-os41-1 - I wonder if this has anything to do with it. Which task is the one getting stuck?

Thanks,
Luke

comment:2 Changed 10 months ago by scottan

Hi Luke,

Sorry, I should have said that all of the tasks are getting stuck, and they get stuck again when I reset them to "waiting". in that sense, it does look different to when I had a similar problem before.

Cheers,
Scott

comment:3 Changed 10 months ago by scottan

Those um_tools and scitools changes seem to already be in the MONSooN.rc file:

POSTPROC_RESOURCE?

inherit = HPC_SERIAL, RETRIES
pre-script = """

module load moose-client-wrapper python/v2.7.9 scitools/production-os41-1 um_tools

"""

comment:4 Changed 10 months ago by luke

I think the scitools is a Red Herring. I've just posted the following to the Monsoon2 Yammer group:

I'm having problems with suite u-bi080 where tasks (postproc and atmos_main) fall into submit-failed, but if I reset them to waiting they run and then the next task in the next cycle falls into submit-failed again. Looking through the site/MONSooN.rc file this suite has 

host = $(rose host-select {{EXTRACT_HOST}}) under [[EXTRACT_RESOURCE]]

host = $(rose host-select xcs-c) under [[HPC]]

host = localhost under [[HOUSEKEEP_RESOURCE]]

with nothing under [[POSTPROC_RESOURCE]] explicitly, but this inherits HPC_SERIAL which inherits HPC so I assume uses the 2nd of the 3 host specifications above.

In a previous thread it was suggested to replace all host lines with 

host = localhost

Could the fact that this hasn’t been done in this suite be the cause of these issues?

Could you check your site/MONSooN.rc file and see how the host is specified. The Monsoon2 advice here

https://collab.metoffice.gov.uk/twiki/bin/view/Support/RetirementOfRoseCylcVMs

says to replace the host line to use localhost, or to remove it entirely.

comment:5 Changed 10 months ago by luke

I have made this change in my suite (host = localhost) but it will be several hours before it gets back to a new postproc task.

comment:6 Changed 10 months ago by scottan

Hi Luke,

Just tried a run with every case of host = xxx changed to host = localhost. I get an immediate "failed" for fcm_make_pp, and fcm_make_um and install_ancel get stuck "submit-retrying".

cheers,
Scott

comment:7 Changed 10 months ago by luke

Is there any information from any of the log files of these tasks?

comment:8 Changed 10 months ago by scottan

fcm_make_pp (failed) job.err:

/usr/lib64/python2.6/site-packages/requests/packages/urllib3/connection.py:337: SubjectAltNameWarning?: Certificate for xcslc1 has no subjectAltName, falling back to check for a commonName for now. This feature is being removed by major browsers and deprecated by RFC 2818. (See https://github.com/shazow/urllib3/issues/497 for details.)

SubjectAltNameWarning?

[FAIL] mirror.target = : incorrect value in declaration
[FAIL] config-file=/working/d04/sanic/cylc-run/u-bi356/work/19880901T0000Z/fcm_make_pp/fcm-make.cfg:4
[FAIL] config-file= - file:///home/d04/fcm/srv/svn/moci.xm/main/trunk/Postprocessing/fcm_make/postproc.cfg@2381:12
[FAIL] config-file= - - file:///home/d04/fcm/srv/svn/moci.xm/main/trunk/Postprocessing/fcm_make/inc/remote.cfg@2381:6

[FAIL] fcm make -f /working/d04/sanic/cylc-run/u-bi356/work/19880901T0000Z/fcm_make_pp/fcm-make.cfg -C /home/d04/sanic/cylc-run/u-bi356/share/fcm_make_pp -j 4 # return-code=9
2019-04-30T11:15:46Z CRITICAL - failed/EXIT
/usr/lib64/python2.6/site-packages/requests/packages/urllib3/connection.py:337: SubjectAltNameWarning?: Certificate for xcslc1 has no subjectAltName, falling back to check for a commonName for now. This feature is being removed by major browsers and deprecated by RFC 2818. (See https://github.com/shazow/urllib3/issues/497 for details.)

SubjectAltNameWarning?

No error files made for the other job, as I can't submit them in the first place.

Cheers,
Scott

comment:9 Changed 10 months ago by luke

Can you commit your changes to your suite please.

comment:10 Changed 10 months ago by luke

Actually, after playing around with my vn11.1 suite, I have got the following to work:

  1. Delete all instances of host within the site/MONSooN.rc file (although setting them to localhost should also be equivalent). You'll also need to remove the [[[remote]]] on the line above if there is nothing else in that section.
  1. Under the [[PPBUILD_RESOURCE]] section add the following (in line with the general indentation of text):
            [[[remote]]]
                host = $(rose host-select xcs-c)
    

Then fcm commit and rose suite-run [--restart|--reload] your suite.

Please try this and let me know how you get on.

Thanks,
Luke

comment:11 Changed 10 months ago by scottan

Hi Luke,

That's still not working. fcm_make_pp suceeded now, but everything else is hanging. I've fcm committed my changes, so you can have a check to see if I have done it correctly if you want:

u-bi356

Cheers,
Scott

comment:12 Changed 10 months ago by luke

Hi Scott,

Looking at the job-activity.log file of your fcm_make_um task you can see

[jobs-submit cmd] cylc jobs-submit --utc-mode -- /home/d04/sanic/cylc-run/u-bi356/log/job 19880901T0000Z/fcm_make_um/03
[jobs-submit ret_code] 32
[jobs-submit out] 2019-04-30T15:25:39Z|19880901T0000Z/fcm_make_um/03|32|None
2019-04-30T15:25:39Z [STDERR] qsub: error: [PBSInvalidProject] 'ukca-meto' is not valid for collaboration trustzone on XCS
[(('event-mail', 'submission retry'), 3) ret_code] 0

i.e. you haven't changed your Monsoo2 project to ukca-cam. This is set under the Account_Monsoon variable found at
suite conf —> Most Machine —> Met Office / Monsoon.

Please see changeset 115052

https://code.metoffice.gov.uk/trac/roses-u/changeset/115052

Thanks,
Luke

comment:13 Changed 10 months ago by scottan

Thanks Luke, that's done it.

cheers,
Scott

comment:14 Changed 10 months ago by luke

  • Resolution set to fixed
  • Status changed from new to closed
Note: See TracTickets for help on using tickets.