Opened 8 months ago

Closed 8 months ago

#2890 closed error (fixed)

11.3 job hanging on MONSooN2

Reported by: scottan Owned by: um_support
Component: UM Model Keywords: UKCA, SC0138
Cc: Platform: Monsoon2
UM Version: <select version>

Description

Hello,

Having the same problem again of getting stuck on submit-retying. This is one version 11.3, using a freshly copied suite from u-bg779. The MONSooN.rc file is different to previous versions so was not sure what to change.

http://www.ukca.ac.uk/wiki/index.php/GA7.1_StratTrop_suites#TS2000_free-running_suites

Note the most recent version I could select in the dropdown menu below was 11.2.

Many thanks,
Scott

Change History (14)

comment:1 Changed 8 months ago by luke

Hi Scott,

If you reset the state of the tasks to "waiting", does it then proceed but then hang again at the next cycle point?

I have been seeing this behaviour in one of my AerChemMIP AMIP experiments at vn11.1.

Due to mule changes the um_tools module needs to be loaded, and the scitools module was fixed at scitools/production-os41-1 - I wonder if this has anything to do with it. Which task is the one getting stuck?

Thanks,
Luke

comment:2 Changed 8 months ago by scottan

Hi Luke,

Sorry, I should have said that all of the tasks are getting stuck, and they get stuck again when I reset them to "waiting". in that sense, it does look different to when I had a similar problem before.

Cheers,
Scott

comment:3 Changed 8 months ago by scottan

Those um_tools and scitools changes seem to already be in the MONSooN.rc file:

POSTPROC_RESOURCE?

inherit = HPC_SERIAL, RETRIES
pre-script = """

module load moose-client-wrapper python/v2.7.9 scitools/production-os41-1 um_tools

"""

comment:4 Changed 8 months ago by luke

I think the scitools is a Red Herring. I've just posted the following to the Monsoon2 Yammer group:

I'm having problems with suite u-bi080 where tasks (postproc and atmos_main) fall into submit-failed, but if I reset them to waiting they run and then the next task in the next cycle falls into submit-failed again. Looking through the site/MONSooN.rc file this suite has 

host = $(rose host-select {{EXTRACT_HOST}}) under [[EXTRACT_RESOURCE]]

host = $(rose host-select xcs-c) under [[HPC]]

host = localhost under [[HOUSEKEEP_RESOURCE]]

with nothing under [[POSTPROC_RESOURCE]] explicitly, but this inherits HPC_SERIAL which inherits HPC so I assume uses the 2nd of the 3 host specifications above.

In a previous thread it was suggested to replace all host lines with 

host = localhost

Could the fact that this hasn’t been done in this suite be the cause of these issues?

Could you check your site/MONSooN.rc file and see how the host is specified. The Monsoon2 advice here

https://collab.metoffice.gov.uk/twiki/bin/view/Support/RetirementOfRoseCylcVMs

says to replace the host line to use localhost, or to remove it entirely.

comment:5 Changed 8 months ago by luke

I have made this change in my suite (host = localhost) but it will be several hours before it gets back to a new postproc task.

comment:6 Changed 8 months ago by scottan

Hi Luke,

Just tried a run with every case of host = xxx changed to host = localhost. I get an immediate "failed" for fcm_make_pp, and fcm_make_um and install_ancel get stuck "submit-retrying".

cheers,
Scott

comment:7 Changed 8 months ago by luke

Is there any information from any of the log files of these tasks?

comment:8 Changed 8 months ago by scottan

fcm_make_pp (failed) job.err:

/usr/lib64/python2.6/site-packages/requests/packages/urllib3/connection.py:337: SubjectAltNameWarning?: Certificate for xcslc1 has no subjectAltName, falling back to check for a commonName for now. This feature is being removed by major browsers and deprecated by RFC 2818. (See https://github.com/shazow/urllib3/issues/497 for details.)

SubjectAltNameWarning?

[FAIL] mirror.target = : incorrect value in declaration
[FAIL] config-file=/working/d04/sanic/cylc-run/u-bi356/work/19880901T0000Z/fcm_make_pp/fcm-make.cfg:4
[FAIL] config-file= - file:///home/d04/fcm/srv/svn/moci.xm/main/trunk/Postprocessing/fcm_make/postproc.cfg@2381:12
[FAIL] config-file= - - file:///home/d04/fcm/srv/svn/moci.xm/main/trunk/Postprocessing/fcm_make/inc/remote.cfg@2381:6

[FAIL] fcm make -f /working/d04/sanic/cylc-run/u-bi356/work/19880901T0000Z/fcm_make_pp/fcm-make.cfg -C /home/d04/sanic/cylc-run/u-bi356/share/fcm_make_pp -j 4 # return-code=9
2019-04-30T11:15:46Z CRITICAL - failed/EXIT
/usr/lib64/python2.6/site-packages/requests/packages/urllib3/connection.py:337: SubjectAltNameWarning?: Certificate for xcslc1 has no subjectAltName, falling back to check for a commonName for now. This feature is being removed by major browsers and deprecated by RFC 2818. (See https://github.com/shazow/urllib3/issues/497 for details.)

SubjectAltNameWarning?

No error files made for the other job, as I can't submit them in the first place.

Cheers,
Scott

comment:9 Changed 8 months ago by luke

Can you commit your changes to your suite please.

comment:10 Changed 8 months ago by luke

Actually, after playing around with my vn11.1 suite, I have got the following to work:

  1. Delete all instances of host within the site/MONSooN.rc file (although setting them to localhost should also be equivalent). You'll also need to remove the [[[remote]]] on the line above if there is nothing else in that section.
  1. Under the [[PPBUILD_RESOURCE]] section add the following (in line with the general indentation of text):
            [[[remote]]]
                host = $(rose host-select xcs-c)
    

Then fcm commit and rose suite-run [--restart|--reload] your suite.

Please try this and let me know how you get on.

Thanks,
Luke

comment:11 Changed 8 months ago by scottan

Hi Luke,

That's still not working. fcm_make_pp suceeded now, but everything else is hanging. I've fcm committed my changes, so you can have a check to see if I have done it correctly if you want:

u-bi356

Cheers,
Scott

comment:12 Changed 8 months ago by luke

Hi Scott,

Looking at the job-activity.log file of your fcm_make_um task you can see

[jobs-submit cmd] cylc jobs-submit --utc-mode -- /home/d04/sanic/cylc-run/u-bi356/log/job 19880901T0000Z/fcm_make_um/03
[jobs-submit ret_code] 32
[jobs-submit out] 2019-04-30T15:25:39Z|19880901T0000Z/fcm_make_um/03|32|None
2019-04-30T15:25:39Z [STDERR] qsub: error: [PBSInvalidProject] 'ukca-meto' is not valid for collaboration trustzone on XCS
[(('event-mail', 'submission retry'), 3) ret_code] 0

i.e. you haven't changed your Monsoo2 project to ukca-cam. This is set under the Account_Monsoon variable found at
suite conf —> Most Machine —> Met Office / Monsoon.

Please see changeset 115052

https://code.metoffice.gov.uk/trac/roses-u/changeset/115052

Thanks,
Luke

comment:13 Changed 8 months ago by scottan

Thanks Luke, that's done it.

cheers,
Scott

comment:14 Changed 8 months ago by luke

  • Resolution set to fixed
  • Status changed from new to closed
Note: See TracTickets for help on using tickets.