Opened 4 weeks ago

Closed 3 weeks ago

#3080 closed help (fixed)

Jobs suddenly stuck with submit-retrying

Reported by: taubry Owned by: um_support
Component: UM Model Keywords: ARCHER, postproc, submit-retrying
Cc: Platform: ARCHER
UM Version: 11.2

Description

Dear helpdesk team,

I am running a few jobs(u-bo887,888,889,890,892) on ARCHER. I already ran these jobs multiple times without any problem (I'm just changing parameters of a volcanic injection between different runs). However, since yesterday, the jobs fail to get submitted properly and go to submit-retrying. atmos-main generally ends up going through but all my postproc tasks fail definitly after a couple submit-retry. If I manually set them to 'waiting' then they generally proceed, but that's of course a bit annoying…
My /work space on ARCHER has now reached its quota but it's not the origin of the problem, just a consequence as the atmos_main tasks kept running while postproc failed.

I suspect that the problem is related to ARCHER as these jobs were running fine before?

Thanks for any clue on how to solve this!

Thomas

Change History (12)

comment:1 Changed 4 weeks ago by grenville

Thomas

The post-proc log says:

FAIL] disk I/O error

Your /work quota is exhausted.

Grenville

comment:2 Changed 4 weeks ago by taubry

Hi Grenville,

Thanks! When the problem first occurred yesterday, my /work quota was completely fine and I could manually re-trigger postproc tasks stopped at submit-fail.
Obviously I could not do this overnight which is why my /work is now full (atmos_main tasks kept running but postproc tasks got stuck).
That's why I think that my /work being full is not the origin of the problem? I re-triggered failed postproc tasks so hopefully I should free some space on my /work in the coming hours.

Thomas

comment:3 Changed 4 weeks ago by taubry

I realize now that I indeed have to make some space on my /work to get the postproc tasks to run. Do I have to delete data produced by these jobs, or would it be possible to get a temporary increase of my /work quota just to get these postproc tasks running?

I have run these 5 jobs in parallel previously and my /work never got more than 50% full. So really the reason it is full now is because of postproc tasks getting stuck at submit-retrying since yesterday.

comment:4 Changed 4 weeks ago by grenville

I have increased your quota to 2TB (it may take a short time to be usable).

comment:5 Changed 4 weeks ago by taubry

Hi Grenville,

Thanks! I could now retrigger all postproc tasks and they are running or successfully submitted. However, I got a 'submit-failed' for some of them multiple time again. I will see if things change at the next cycle.

Thomas

comment:6 Changed 4 weeks ago by grenville

what do you get when you run (from pumatest)

rose host-select archer

comment:7 Changed 4 weeks ago by taubry

I get:
login.archer.ac.uk

comment:8 Changed 4 weeks ago by taubry

Some of the tasks now go through without problem, but I still have quite a few ending up with 'submit-fail'. For example, the postproc task of my suite u-bo892 submit-failed. I did not re-trigger it in case you want to have a look. There are no associated job.err, job.out or job-activity (I guess because it did not even submit?) so I'm not sure where to look for what caused the submission failure.

comment:9 Changed 4 weeks ago by grenville

see http://cms.ncas.ac.uk/ticket/3082 for a possible fix

comment:10 Changed 4 weeks ago by taubry

  • Resolution set to fixed
  • Status changed from new to closed

Thanks! It's been running fine all afternoon now so I will close the ticket. Fingers crossed that the problem doesn't come back.

comment:11 Changed 4 weeks ago by taubry

  • Resolution fixed deleted
  • Status changed from closed to reopened

Not exactly the same problem this morning, but similar. All my atmos_main tasks have been stuck at status 'submitted' since yesterday evening, which is way more than any queuing time I've had with these jobs so I'm a bit suspicious. I don't see anything concerning on the ARCHER status page. I am way below all my quotas on pumatest, and /home and /nerc on ARCHER, and there are ca. 2 MAUs left on n02-chem on which I am running.

The job activity.log contains the following lines, repeated every 5 min:
[jobs-poll out] 2019-11-19T08:44:52Z|20570101T0000Z/atmos_main/01|{"batch_sys_name": "pbs", "batch_sys_job_id": "6653784.sdb", "batch_sys_exit_polled": 0, "time_submit_exit": "2019-11-18T18:03:55Z"}
[jobs-poll ret_code] 0

I will try to retrigger my jobs u-bo887-890 but leave u-bo892 as is.

Thanks,

Thomas

comment:12 Changed 3 weeks ago by taubry

  • Resolution set to fixed
  • Status changed from reopened to closed
Note: See TracTickets for help on using tickets.