#2260 closed help (fixed)

Run exceeding time on ARCHER

Reported by: vanniere Owned by: ros
Component: UM Model Keywords:
Cc: Platform: ARCHER
UM Version: 10.3

Description

u-ap799 is copy of a ncas-training simulation (u-ag137) where a few stash were changed together with the number of nodes.
It was supposed to be a 16-month runs. It stops after doing around 6 months by exceeding the 1-hour time limit, whereas all the previous cycles were done in 20min.
It does not write any output during this last cycle. I tried to restart a couple of time but it didn’t change anything.

On archer:
/home/n02/n02/vanniere/work/cylc-run/u-ap799

u-ap936 is a copy of u-ap799 that was recompiled. Now the same problem as reported above occurs from the very first cycle.

On archer:
/home/n02/n02/vanniere/work/cylc-run/u-ap936

Let me know if you need any additional information.
Thank you for your help.

Best wishes,
Benoit

Change History (3)

comment:1 Changed 22 months ago by ros

  • Owner changed from um_support to ros
  • Status changed from new to accepted

Hi Benoit,

We are investigating this and are aware of another user experiencing slow running jobs on ARCHER. There have been some weird things going on with ARCHER in the last day or so and we are in conversation with them.

At the moment I can only suggest increasing the walltime to allow you to continue with the run in the meantime.

Cheers,
Ros.

comment:2 Changed 22 months ago by vanniere

Hi Ros,

Grenville told me that the Cray team identified “a possible user running jobs which is causing a lot of i/o contention on fs2 (/work for n02)” , resulting in the poor performance on ARCHER.

It seems all sorted now as my job is running ok.

Thank you for your help on that.
Cheers,
Benoit

comment:3 Changed 22 months ago by willie

  • Resolution set to fixed
  • Status changed from accepted to closed
Note: See TracTickets for help on using tickets.