Opened 5 years ago
Closed 5 years ago
#1925 closed help (answered)
FW: UM submit fails - timeouts during submit?
Reported by: | ros | Owned by: | um_support |
---|---|---|---|
Component: | Rose | Keywords: | |
Cc: | Platform: | MONSooN | |
UM Version: | 10.3 |
Description (last modified by ros)
Forwarded from MONSooN
Hi,
I'm trying to run a ROSE suite on MONSooN which is almost identical to one I've just run successfully. The suite fails for a random job in the suite as a failure to submit the job. Since the job hasn't run there's no output and all I can see in the log directory is a "job-activity.log" file. The contents of this file for the latest failure is:
2016-07-25T14:24:06Z [job-submit cmd] (prepare job file) 2016-07-25T14:24:06Z [job-submit ret_code] 1 2016-07-25T14:24:06Z [job-submit err] Host selection by $(rose host-select xcm) failed: ERROR: command timed out (>10s), terminated by signal 15 rose host-select xcm 2016-07-25T14:24:19Z [(('event-handler-00', 'submission failed'), 1) cmd] rose suite-hook --mail 'submission failed' 'u-af015' 'glm_um_fcst_000.20150612T1200Z' 'job submission failed' 2016-07-25T14:24:19Z [(('event-handler-00', 'submission failed'), 1) ret_code] 0
This is with a clean start of the ROSE suite. I can't see anywhere I can up the timeout limit from 10s. My suite is u-af015.
Any assistance you could provide in fixing my problem would be greatly appreciated.
best wishes,
John
Change History (3)
comment:1 Changed 5 years ago by ros
comment:2 Changed 5 years ago by ros
- Description modified (diff)
comment:3 Changed 5 years ago by ros
- Resolution set to answered
- Status changed from new to closed
Matt's response:
If rose host-select xcm fails with "command timed out", it means that exvmscylc is unable to connect to xcml01 via SSH in more than 10 seconds. Has there been a network issue?
I can’t see why the suite needs to be shut down and restarted to deal with a task submission failure. All you should have to do is retrigger the submit-failed tasks, e.g. with cylc trigger SUITE ‘*:submit-failed’.
In order to improve the robustness of the suite, we recommend that you add some submission retries to your tasks, e.g.:
[runtime] [[root]] [[[job submission]]] retry delays = 3*PT5M,3*PT15M,3*PT30M
Hi,
I'm writing again about the problem I'm having with my ROSE suite (following up from an email I sent on Monday) - I have now got a way to keep the suite running. I'm still getting submit-failed error messages, but if I edit the latest state file from the suite, find the job that's failed and change it back to "waiting" from "submit-failed" then "rose suite-run —restart" will be able to restart it (if I don't do this it won't work). It does mean I have to keep a close eye on the suite but at least I should be able to get the simulations complete.
cheers,
John