Opened 3 years ago

Closed 3 years ago

#1925 closed help (answered)

FW: UM submit fails - timeouts during submit?

Reported by: ros Owned by: um_support
Component: Rose Keywords:
Cc: Platform: MONSooN
UM Version: 10.3

Description (last modified by ros)

Forwarded from MONSooN


Hi,

I'm trying to run a ROSE suite on MONSooN which is almost identical to one I've just run successfully. The suite fails for a random job in the suite as a failure to submit the job. Since the job hasn't run there's no output and all I can see in the log directory is a "job-activity.log" file. The contents of this file for the latest failure is:

2016-07-25T14:24:06Z [job-submit cmd] (prepare job file)
2016-07-25T14:24:06Z [job-submit ret_code] 1
2016-07-25T14:24:06Z [job-submit err]

Host selection by $(rose host-select xcm) failed:
  ERROR: command timed out (>10s), terminated by signal 15

rose host-select xcm

2016-07-25T14:24:19Z [(('event-handler-00', 'submission failed'), 1) cmd] rose suite-hook --mail 'submission failed' 'u-af015' 'glm_um_fcst_000.20150612T1200Z' 'job submission failed'
2016-07-25T14:24:19Z [(('event-handler-00', 'submission failed'), 1) ret_code] 0

This is with a clean start of the ROSE suite. I can't see anywhere I can up the timeout limit from 10s. My suite is u-af015.

Any assistance you could provide in fixing my problem would be greatly appreciated.

best wishes,
John

Change History (3)

comment:1 Changed 3 years ago by ros

Hi,

I'm writing again about the problem I'm having with my ROSE suite (following up from an email I sent on Monday) - I have now got a way to keep the suite running. I'm still getting submit-failed error messages, but if I edit the latest state file from the suite, find the job that's failed and change it back to "waiting" from "submit-failed" then "rose suite-run —restart" will be able to restart it (if I don't do this it won't work). It does mean I have to keep a close eye on the suite but at least I should be able to get the simulations complete.


cheers,

John

comment:2 Changed 3 years ago by ros

  • Description modified (diff)

comment:3 Changed 3 years ago by ros

  • Resolution set to answered
  • Status changed from new to closed

Matt's response:

If rose host-select xcm fails with "command timed out", it means that exvmscylc is unable to connect to xcml01 via SSH in more than 10 seconds. Has there been a network issue?

I can’t see why the suite needs to be shut down and restarted to deal with a task submission failure. All you should have to do is retrigger the submit-failed tasks, e.g. with cylc trigger SUITE ‘*:submit-failed’.

In order to improve the robustness of the suite, we recommend that you add some submission retries to your tasks, e.g.:

[runtime]
    [[root]]
        [[[job submission]]]
            retry delays = 3*PT5M,3*PT15M,3*PT30M


Note: See TracTickets for help on using tickets.