Opened 16 months ago
Closed 16 months ago
#3080 closed help (fixed)
Jobs suddenly stuck with submit-retrying
Reported by: | taubry | Owned by: | um_support |
---|---|---|---|
Component: | UM Model | Keywords: | ARCHER, postproc, submit-retrying |
Cc: | Platform: | ARCHER | |
UM Version: | 11.2 |
Description
Dear helpdesk team,
I am running a few jobs(u-bo887,888,889,890,892) on ARCHER. I already ran these jobs multiple times without any problem (I'm just changing parameters of a volcanic injection between different runs). However, since yesterday, the jobs fail to get submitted properly and go to submit-retrying. atmos-main generally ends up going through but all my postproc tasks fail definitly after a couple submit-retry. If I manually set them to 'waiting' then they generally proceed, but that's of course a bit annoying…
My /work space on ARCHER has now reached its quota but it's not the origin of the problem, just a consequence as the atmos_main tasks kept running while postproc failed.
I suspect that the problem is related to ARCHER as these jobs were running fine before?
Thanks for any clue on how to solve this!
Thomas
Change History (12)
comment:1 Changed 16 months ago by grenville
comment:2 Changed 16 months ago by taubry
Hi Grenville,
Thanks! When the problem first occurred yesterday, my /work quota was completely fine and I could manually re-trigger postproc tasks stopped at submit-fail.
Obviously I could not do this overnight which is why my /work is now full (atmos_main tasks kept running but postproc tasks got stuck).
That's why I think that my /work being full is not the origin of the problem? I re-triggered failed postproc tasks so hopefully I should free some space on my /work in the coming hours.
Thomas
comment:3 Changed 16 months ago by taubry
I realize now that I indeed have to make some space on my /work to get the postproc tasks to run. Do I have to delete data produced by these jobs, or would it be possible to get a temporary increase of my /work quota just to get these postproc tasks running?
I have run these 5 jobs in parallel previously and my /work never got more than 50% full. So really the reason it is full now is because of postproc tasks getting stuck at submit-retrying since yesterday.
comment:4 Changed 16 months ago by grenville
I have increased your quota to 2TB (it may take a short time to be usable).
comment:5 Changed 16 months ago by taubry
Hi Grenville,
Thanks! I could now retrigger all postproc tasks and they are running or successfully submitted. However, I got a 'submit-failed' for some of them multiple time again. I will see if things change at the next cycle.
Thomas
comment:6 Changed 16 months ago by grenville
what do you get when you run (from pumatest)
rose host-select archer
comment:7 Changed 16 months ago by taubry
I get:
login.archer.ac.uk
comment:8 Changed 16 months ago by taubry
Some of the tasks now go through without problem, but I still have quite a few ending up with 'submit-fail'. For example, the postproc task of my suite u-bo892 submit-failed. I did not re-trigger it in case you want to have a look. There are no associated job.err, job.out or job-activity (I guess because it did not even submit?) so I'm not sure where to look for what caused the submission failure.
comment:9 Changed 16 months ago by grenville
see http://cms.ncas.ac.uk/ticket/3082 for a possible fix
comment:10 Changed 16 months ago by taubry
- Resolution set to fixed
- Status changed from new to closed
Thanks! It's been running fine all afternoon now so I will close the ticket. Fingers crossed that the problem doesn't come back.
comment:11 Changed 16 months ago by taubry
- Resolution fixed deleted
- Status changed from closed to reopened
Not exactly the same problem this morning, but similar. All my atmos_main tasks have been stuck at status 'submitted' since yesterday evening, which is way more than any queuing time I've had with these jobs so I'm a bit suspicious. I don't see anything concerning on the ARCHER status page. I am way below all my quotas on pumatest, and /home and /nerc on ARCHER, and there are ca. 2 MAUs left on n02-chem on which I am running.
The job activity.log contains the following lines, repeated every 5 min:
[jobs-poll out] 2019-11-19T08:44:52Z|20570101T0000Z/atmos_main/01|{"batch_sys_name": "pbs", "batch_sys_job_id": "6653784.sdb", "batch_sys_exit_polled": 0, "time_submit_exit": "2019-11-18T18:03:55Z"}
[jobs-poll ret_code] 0
I will try to retrigger my jobs u-bo887-890 but leave u-bo892 as is.
Thanks,
Thomas
comment:12 Changed 16 months ago by taubry
- Resolution set to fixed
- Status changed from reopened to closed
Thomas
The post-proc log says:
FAIL] disk I/O error
Your /work quota is exhausted.
Grenville