Opened 5 years ago

Closed 5 years ago

#1542 closed help (fixed)

Jobs timing out on ARCHER

Reported by: dcw32 Owned by: um_support
Component: UM Model Keywords:
Cc: luke Platform: ARCHER
UM Version: 8.4

Description

Hi,

I've been having some trouble with my jobs running overnight on ARCHER - xlgad ran 2 months 22 days and xlgac ran 1 month 7 days before timing out after 8 hours, e.g.:

mkdir:: File exists
=>> PBS: job killed: walltime 28826 exceeded limit 28800
aprun: Apid 13602459: Caught signal Terminated, sending to application
-ksh: line 1: 16523: Terminated
/home/n02/n02/dcw32/umui_runs/xlgad-105072837/umuisubmit_run[347]: .: line 265: 16696: Terminated
Application 13602459 is crashing. ATP analysis proceeding...

see xlgac000.xlgac.d15105.t120833.leave and xlgad000.xlgad.d15105.t072849.leave on /home/n02/n02/dcw32/output. The output is in /work/n02/n02/dcw32/xlgac/ and /work/n02/n02/dcw32/xlgad/.

Both were running 3 months comfortably within this time for ~1 year and both jobs are based on the vn8.4 UM-UKCA base job. I was wondering if anyone knows what might be causing this or if it's likely to be an ARCHER issue?

Many thanks,

David

Change History (14)

comment:1 Changed 5 years ago by karthee

We have similar issues in our High resolution runs. Please turn on the
"Subroutine Timer diagnostics" for your subsequent runs so that we can debug the issue.

Also do your runs do a lot of IO? If possible please test a trial run with no STASH.

comment:2 Changed 5 years ago by dcw32

There's a reasonable amount of monthly mean data, but no more daily data than the base job. I've sent 1 month jobs to the queue with the subroutine timer diagnostics, xlgad with STASH and xlgac with no STASH and I'll let you know how I get on.

Thanks,

David

comment:3 Changed 5 years ago by dcw32

So I don't think STASH is the issue, the run with no STASH actually ran slower… 5hr20 for a month vs 4hr41.

The output files are /home/n02/n02/dcw32/output/xlgad000.xlgad.d15107.t163224.leave with STASH and /home/n02/n02/dcw32/output/xlgac000.xlgac.d15107.t162910.leave without STASH.

comment:4 Changed 5 years ago by pliojop

Afternoon,

Apologies to jump in on this ticket, but I expect I am experiencing the same issues.

I have also noted the same problem with the UM-UKCA at 8.4, 2 jobs that were running 3 months 1 week in about 8 hours up till and including April 15th, have since managed between 1 and 2.5 months in the same time period for every day since. There has been no change made to these jobs over the course of last week.

I know of another user running UM 8.4 without UKCA who is also having issues with jobs not running for as long as they were, again starting the middle of last week.

Thanks

James

comment:5 Changed 5 years ago by ros

Hi James, David,

Thanks for the update. I did contact ARCHER at the end of last week but have yet to hear anything back from them. We will chase this up.

Regards,
Ros.

comment:6 Changed 5 years ago by lsim

Hi Ros,
Version 4.5 is also running at about one third of the usual rate.
Thanks,
Louise

comment:7 Changed 5 years ago by pliojop

Hi Ros,

I had hoped that after the maintenance this week the problem would have resolved, but at present it appears to not have done so. Has Archer been able to offer any insight on this matter?

Cheers

James

comment:8 Changed 5 years ago by jscreen

Hi all

I'm having similar problems running version 6.3.

Jobs that used to take 10-15 minutes per model month are now taking 4-5 hours per model month.

Hence jobs keep timing out, despite me upping the time allowance way above what I would normally use.

James

comment:9 Changed 5 years ago by michmcr

Hi Ros,

I too am finding the model UM8.4 is running at about half of its usual rate, normally it would take 18 hours to run 3 years however it is now taking 24 hours to run 1 and a half years. Has there been any update from ARCHER on this matter.

Thanks
Michelle

comment:10 Changed 5 years ago by grenville

Hi all

These quotes from ARCHER:

"The Cray team informed us that there had been some disc failures in fs2 area (where

n02 files are located) this morning and a parallel rebuild is ongoing, which will
slow down the filesystem."

"I have heard back from the Cray team and they expect that this may be ongoing through

Monday. "

Grenville

comment:11 Changed 5 years ago by grenville

Hi all

I have run some tests on ARCHER which seem to indicate that the IO problems which have plagued us recently may have been fixed with the file system rebuild.

Please let us know if you still see problems.

Grenville

comment:12 Changed 5 years ago by dcw32

My job now completes within a time comparable to before the slowdown. It looks as though (at least for 8.4) everything is back to normal.

Many thanks to Grenville and Ros for your help with this!

David

comment:13 Changed 5 years ago by pliojop

Hi Greville & Ros,

Overnight two of my jobs ran and resubmitted twice as well. Thanks for your assistance in this matter.

James

comment:14 Changed 5 years ago by ros

  • Resolution set to fixed
  • Status changed from new to closed
Note: See TracTickets for help on using tickets.