#2985 closed help (wontfix)

job held in Archer short queue

Reported by: taubry Owned by: um_support
Component: UM Model Keywords: ARCHER, queue
Cc: Platform: ARCHER
UM Version: 11.2

Description

Hi,

I am running a trial job with my AMIP suite u-bk166 on ARCHER. I submitted the job to the short queue, and never had trouble until today. When I run it, it reaches the atmos_main stage without problem but then gets held. The cylc-run gui actually shows it as "submitted", but it remained with this status for a suspiciously long time and qstat -u ta460 returns:
Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time


6400060.sdb ta460 R6385803 atmos_main — 8 192 — 00:02 H —

so it looks like the status is held. I checked that the n02-ncas budget on which I'm running has not run out. I also tried to kill and restart atmos_main with no success.

checkQueue returns

6400060.sdb atmos_main: job held, too many failed attempts to run

but it's the first time I've seen this error.

Thanks for any advice!

Thomas

Change History (7)

comment:1 follow-up: Changed 14 months ago by dcase

I think that that queue has a maximum walltime of 20 minutes— does that affect you?

comment:2 in reply to: ↑ 1 Changed 14 months ago by taubry

No, my wallclock time request is 2min… The number of proc also meet the short queue requirements, and I had no problem running in the short queue yesterday and even this morning.

Replying to dcase:

I think that that queue has a maximum walltime of 20 minutes— does that affect you?

comment:3 follow-up: Changed 14 months ago by dcase

Tom,

could you check the format of your walltime in the submission script? If you put walltime=00:02:00 that may be it?
If not, could you give me read access to your files, and I'll look at them? It's:

chmod -R g+rX /home/n02/n02/<your-username>
chmod -R g+rX /work/n02/n02/<your-username>

Dave

comment:4 in reply to: ↑ 3 Changed 14 months ago by taubry

Hi Dave,

Thanks! I set my wallclock time in rose to PT2M. It was previously PT19M but I changed it to 2 today as my total length is 1H and I thought the short queue might have been particularly busy this afternoon causing my problem.
I just gave you read access to my home (and it's still executing chmod for my work).

Thomas

Replying to dcase:

Tom,

could you check the format of your walltime in the submission script? If you put walltime=00:02:00 that may be it?
If not, could you give me read access to your files, and I'll look at them? It's:

chmod -R g+rX /home/n02/n02/<your-username>
chmod -R g+rX /work/n02/n02/<your-username>

Dave

comment:5 follow-up: Changed 14 months ago by dcase

Tom,

I've looked at your job, and things seem ok. I can only recommend the obvious things like killing the job and retrying (this does have a small chance of working as the short queue reservation is different today than yesterday). You could also try running on the standard queue as a sanity check.

In order to really solve the problem, someone would have to look at the detailed logs, but only the ARCHER team can do this. If you can't hack your way to things running, as per above or otherwise, give the job IDs to support@… and they can delve more deeply.

I'm sorry I can't see anything beyond this at the moment,
Dave

comment:6 in reply to: ↑ 5 Changed 14 months ago by taubry

Hi Dave,

Ok thanks! Lauren Marshall (user id eelrm) is having the exact same problem this morning on a completely different suite (u-bk467) that was also running fine on the short queue before, so it really sound like an ARCHER problem.

Thomas

Replying to dcase:

Tom,

I've looked at your job, and things seem ok. I can only recommend the obvious things like killing the job and retrying (this does have a small chance of working as the short queue reservation is different today than yesterday). You could also try running on the standard queue as a sanity check.

In order to really solve the problem, someone would have to look at the detailed logs, but only the ARCHER team can do this. If you can't hack your way to things running, as per above or otherwise, give the job IDs to support@… and they can delve more deeply.

I'm sorry I can't see anything beyond this at the moment,
Dave

comment:7 Changed 14 months ago by taubry

  • Resolution set to wontfix
  • Status changed from new to closed
Note: See TracTickets for help on using tickets.