Opened 6 months ago

Closed 5 months ago

Last modified 5 months ago

#2799 closed help (fixed)

Job keeps getting stuck "submit-retrying"

Reported by: scottan Owned by: um_support
Component: UM Model Keywords: UKCA, SC0138
Cc: Platform: Monsoon2
UM Version: 10.9

Description

Dear help desk,

My job keeps getting stuck at the "submit-retrying" stage, usually when running postproc. I am having to manually restart by setting state to "waiting" each month to carry on, which is not a practical way to carry on as I have upcoming deadlines I need these simulations ready for. I am running on MONSOON2 suite be592.

Kind regards,
Scott

Attachments (1)

Screen Shot 2019-03-05 at 13.30.30.png (76.4 KB) - added by scottan 6 months ago.
Screenshot of Cylc when stuck retrying job submitting

Download all attachments as: .zip

Change History (14)

Changed 6 months ago by scottan

Screenshot of Cylc when stuck retrying job submitting

comment:1 Changed 6 months ago by ros

  • Status changed from new to pending

Hi Scott,

The Met Office are looking for a solution to the mktemp error your suite is exhibiting and which has been seen in some other suites.

Regards,
Ros.

P.S. I kindly request that if you contact both CMS and Monsoon helpdesks with the same query can you please indicate that in the ticket so that we don't waste time looking into problems that might be being dealt with elsewhere. Thanks.

comment:2 Changed 6 months ago by scottan

Dear Ros,

Thank you for looking into this. Please let me know if any solution is found to this issue, as it is getting quite urgent that these runs are finished and it is not running as fast as it should be due to this error.

Apologies for not informing you that I had sent questions out to both services, this was just what I was advised to do by coworkers. I will make sure to inform you if I do this in the future. The latest advice from the MONSooN helpdesk did not work:

I have been advised that you are using "host = $(rose host-select xcs-c)", can you replace it with "host = localhost".

This cause fcm_make_pp to fail, a problem I was having before when MONSOON switched over to the xcs servers.

Kind regards,
Scott

comment:3 Changed 6 months ago by ros

Hi Scott,

I or the Met Office will be in touch as soon as there is any news, but it will take a little time, they are currently looking at making a change to the software.

You could try running the first cycle with the rose host-select xcs-c and then once all the fcm_make tasks have finished change the host to be localhost and do a rose suite-run --reload so all subsequent tasks pick up the change. It's a bit of a hack but might work and get you going quicker whist we're waiting for a fix.

Cheers,
Ros.

comment:4 Changed 6 months ago by scottan

Hi Ros,

Thanks for the advice. It just got to the end of a month and managed to pass ok (without me making any changes), which is the first time its done that for a while! I will keep an eye on it and if it fails I will try restarting it without running any of the fcm_make jobs and host = localhost and see if that helps. Please let me know if you hear of a long-term fix.

Thanks,
Scott

comment:5 Changed 6 months ago by ros

Hi Scott,

I have just been informed that a "fix" has now been deployed into cylc-7.8.1 on Monsoon. Hopefully this will fix your problem. Please make sure you are using cylc-7.8.1. You can check this by running cylc --version on xcslc0 command line.

Regards,
Ros.

comment:6 Changed 6 months ago by scottan

Dear Ros,

Thank you for informing me. I have just checked and I am running 7.8.1. Do I need to restart runs that are already going or can I just let them carry on?

Thanks,
Scott

comment:7 Changed 6 months ago by ros

Hi Scott,

I think you'll be ok I suspect you've already picked up the change which was why your suite got past the end of the month earlier. The fix was deployed early this afternoon.

If you do encounter a problem then yes do stop and restart the suite.

Regards,
Ros.

comment:8 Changed 6 months ago by scottan

Hi Ros,

My run just suddenly failed with no discernible error, and I have been unable to get it running again, keep getting submit_failed errors. I've tried various combinations of switching host to localhost and turning off the compile steps, but it makes no difference - none of the cylc steps are running. Do you know of any reason why this might have started to happen?

Kind regards,
Scott

comment:9 Changed 6 months ago by ros

Hi Scott,

See the job-activity.log for any of the tasks:

019-03-07T10:20:14Z [STDERR] qsub: cannot connect to server xcs00 (errno=113)

Looks like you started the run from scratch again rather than just doing a restart so all the logs from the original failure have been deleted so I won't be able to tell you what originally went wrong.

I and others are getting the same error. I've emailed the Monsoon team.

Regards,
Ros.

comment:10 Changed 6 months ago by ros

Hi Scott,

Should be all up and running; try submitting again now.

Regards,
Ros.

comment:11 Changed 6 months ago by scottan

Hi Ros,

Yes, just resubmitted it and seems to have gotten started again ok. For the record, I suspect the previous error was connected (i.e. it was a problem with the server/machine, not the model). It only crashed when it got to the end of the atmos_main job, then when I tried to look at the error files it could not find them through the cylc window. I tried doing a soft reset first and that didn't work, then did a hard reset.

Will let you know if I have any further problems, but fingers crossed will be able to run ok now.

Thanks again,
Scott

comment:12 Changed 5 months ago by ros

  • Resolution set to fixed
  • Status changed from pending to closed

I will close this ticket now. If you find the problem hasn't been fixed then please reopen this ticket.

Cheers,
Ros.

comment:13 Changed 5 months ago by scottan

FYI had this problem again but found a solution that seems to work, if I set the HPC tasks to use xcs-c but postproc specifically to run on localhost. In site/MONSooN.rc:

    [[HPC]]
        pre-script = module load cray-netcdf
        [[[directives]]]
            -W umask=0022
            -P={{ACCOUNT_MONSOON}}
        [[[environment]]]
            PLATFORM = xc40
            UMDIR = /projects/um1
        [[[job]]]
            batch system = pbs
            submission retry delays = 2*PT30S,PT5M,PT15M,PT30M,PT1H
        [[[remote]]]
            host = $(rose host-select xcs-c)

...

    [[POSTPROC_RESOURCE]]
        inherit = HPC_SERIAL, RETRIES
        pre-script = """
           module load um_tools 
           module load moose-client-wrapper python/v2.7.9
           PYTHONPATH=${PYTHONPATH}:/projects/um1/lib/python2.7/
         """
        {% if mooproject is defined %}
        script = {{TASK_RUN_COMMAND}} --define="[namelist:suitegen]mooproject={{MOOPROJECT}}"
        {% endif %}
        [[[job]]]
            execution time limit = PT1H
        [[[remote]]]
            host = localhost

Note: See TracTickets for help on using tickets.