Opened 4 years ago

Closed 4 years ago

#2159 closed help (fixed)

Failed jobs under CYLC on Archer

Reported by: apm Owned by: annette
Component: NEMO/CICE Keywords:
Cc: Platform: ARCHER
UM Version: <select version>


I have had a couple of NEMO jobs fail on Archer, but can't work out why they failed. One of them is suite u-ak984, and the log files are here:


I don't know how to interpret the job.err file; although the job.out seems to indicate that the model run has finished, I don't know why the resubmit failed.

The other is u_al390, and has failed in the same way (even though both jobs have completed more than a dozen automatic resubmissions).

Can you tell from the job.err why the job failed, and would you be able to advise on how to restart these jobs? One thing that may be relevant is that the failed restart was after Archer came back on line after the maintenance shutdown on Wednesday 19 April, and I had a previous job fail to restart correctly under the same circumstances when the system shut down two weeks earlier.



Change History (3)

comment:1 Changed 4 years ago by annette

  • Component changed from UM Model to NEMO/CICE
  • Owner changed from um_support to annette
  • Platform set to ARCHER
  • Status changed from new to assigned

Hi Alex,

It is strange that both suites failed at the same time. Have you run out of disk space? You should be able to check this in SAFE.

A core file was written but I can't see it - can you change the permissions please?

chmod g+r /work/n01/n01/alexm/cylc-run/u-ak984/work/19630401T0000Z/nemo_cice/core


comment:2 Changed 4 years ago by apm

Hi Annette,

I have changed the permissions on the core files for both runs.



comment:3 Changed 4 years ago by annette

  • Resolution set to fixed
  • Status changed from assigned to closed


I think you got past this (although I cant remember what the issue was in the end), so closing this ticket.


Note: See TracTickets for help on using tickets.