Opened 3 years ago

Closed 2 years ago

#2030 closed help (answered)

Archer job got stuck in the compilation and running indefintely

Reported by: Emre Owned by: ros
Component: UKCA Keywords:
Cc: Platform:
UM Version: 8.4

Description

Dear Helpdesk

I wanted to bring to your attention a situation which I never had before. I sent a job to archer (6months run-length). It is a copy of an earlier job which ran successfully.

4082635.sdb emre serial xmufr_buil 36093 1 1 — 02:00 R 13:21

The job seems to be taking too long for compilation. Although it is well beyond the set upper-limit it is still running. I think part of the code got stuck somewhere and it is going to keep running indefinitely.

I did "qdel" the job, but it is still running and it is causing resource leakage. Could you please tell me what is wrong and how to terminate the compilation.

Are there any reasons why the job just kept on compilation? Oddly a copy of the same job died immediately on the compilation stage (xmufs) with the record
5166 Nov 30 06:22 xmufp000.xmufp.d16335.t062000.comp.leave
I dont understand this either.
Please advise
Thanks
Emre

Change History (4)

comment:1 Changed 3 years ago by Emre

  • Status changed from new to assigned

comment:2 Changed 3 years ago by grenville

Emre

We think this is the problem (if you didn't get the message below, please check with ARCHER that you are subscribed to appropriate mailing lists)

ARCHER Serial PP Nodes issue



We are currently working on an issue with the ARCHER Serial post-processing nodes which is causing jobs running on them to fail.

In order to resolve this issue our team will be re-starting the PP nodes later today. Any jobs running when the nodes are re-started will fail (but it is likely that they would fail anyway due to the ongoing issue).

We apologise for the inconvenience caused and will let you know once the restart is completed and the nodes are returned to service.

The ARCHER Helpdesk Team
support@…

Grenville

comment:3 Changed 3 years ago by ros

  • Owner changed from um_support to ros
  • Status changed from assigned to accepted

Hi Emre,

Now that the serial nodes have been rebooted has this problem been resolved?

Cheers,
Ros.

comment:4 Changed 2 years ago by ros

  • Resolution set to answered
  • Status changed from accepted to closed

I assume this problem has now been resolved, please reopen if this is not the case.
Ticket being closed due to lack activity.

Note: See TracTickets for help on using tickets.