Opened 4 years ago

Closed 3 years ago

#2268 closed help (answered)

Job stalled. Still in queue. Unable to delete.

Reported by: toddj Owned by: um_support
Component: MONC Keywords:
Cc: Platform: NEXCS
UM Version: <select version>


I submitted two MONC jobs to NEXCS last Friday. Queue times were for 02:30 and 00:25. They are still in the queue right now, listed as running. My status files from the model, which previously always contained some additional information when a job stops, simply stopped writing, and there are no typical model crash errors. Further, I have been unable to delete these jobs from the queue. Not trying to force delete in case someone would like to look at it.

Delete attempt with Job IDs:

Mon Sep 11 09:48:01 2017
                                                                             Req'd    Req'd   Elapsed
Job ID          Username   Queue        Jobname         SessID    NDS  TSK   Memory   Time  S Time          Comment
--------------- ----------- ----------- --------------- --------- ---- ----- -------- ----- - --------      -------
5997097.xcs00   tojon      normal       MONCcont     42087    2    72   1536mb 02:30 R 00:52:59      Job run at Fri Sep 08 at 15:11 on (mom16:mem=1572864kb:ncpus=1)+(xcs_385:ncpus=36)+(xcs_394:ncpus=36)
5997109.xcs00   tojon      normal       MONCcont     --       2    72   1536mb 02:30 H    --
5998414.xcs00   tojon      normal       MONCcont     35576    4   144   1536mb 00:25 R 00:04:58      Job run at Fri Sep 08 at 15:58 on (mom16:mem=1572864kb:ncpus=1)+(xcs_1515:ncpus=36)+(xcs_1516:ncpus=36)+(xcs_1517:ncpus=36)+(xcs_1518:ncpus=36)
5999628.xcs00   tojon      normal       MONCcont     --       4   144   1536mb 00:25 H    --
tojon@xcslc0:/projects/nexcs-n02/tojon/MONC/vn0.8_rce/diagnostic_files> qdel 5999628.xcs00
tojon@xcslc0:/projects/nexcs-n02/tojon/MONC/vn0.8_rce/diagnostic_files> qdel 5997109.xcs00
tojon@xcslc0:/projects/nexcs-n02/tojon/MONC/vn0.8_rce/diagnostic_files> qdel 5998414.xcs00
qdel: Unknown error 18446744073709551614 5998414.xcs00
tojon@xcslc0:/projects/nexcs-n02/tojon/MONC/vn0.8_rce/diagnostic_files> qdel 5997097.xcs00
qdel: Unknown error 18446744073709551614 5997097.xcs00

Looking at all jobs, it appears that this has also happened to others around the same time. I'm simply at a loss for what occurred. I can't find any output errors. Please advise.

Change History (3)

comment:1 Changed 4 years ago by annette

Hi Todd,

This looks like an issue with the Monsoon system, so I will forward to their helpdesk.


comment:2 Changed 4 years ago by willie

  • Status changed from new to pending

comment:3 Changed 3 years ago by willie

  • Resolution set to answered
  • Status changed from pending to closed

No further response from reporter.

Note: See TracTickets for help on using tickets.