Opened 3 years ago
Closed 3 years ago
#2268 closed help (answered)
Job stalled. Still in queue. Unable to delete.
Reported by: | toddj | Owned by: | um_support |
---|---|---|---|
Component: | MONC | Keywords: | |
Cc: | Platform: | NEXCS | |
UM Version: | <select version> |
Description
I submitted two MONC jobs to NEXCS last Friday. Queue times were for 02:30 and 00:25. They are still in the queue right now, listed as running. My status files from the model, which previously always contained some additional information when a job stops, simply stopped writing, and there are no typical model crash errors. Further, I have been unable to delete these jobs from the queue. Not trying to force delete in case someone would like to look at it.
Delete attempt with Job IDs:
Mon Sep 11 09:48:01 2017 Req'd Req'd Elapsed Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time Comment --------------- ----------- ----------- --------------- --------- ---- ----- -------- ----- - -------- ------- 5997097.xcs00 tojon normal MONCcont 42087 2 72 1536mb 02:30 R 00:52:59 Job run at Fri Sep 08 at 15:11 on (mom16:mem=1572864kb:ncpus=1)+(xcs_385:ncpus=36)+(xcs_394:ncpus=36) 5997109.xcs00 tojon normal MONCcont -- 2 72 1536mb 02:30 H -- 5998414.xcs00 tojon normal MONCcont 35576 4 144 1536mb 00:25 R 00:04:58 Job run at Fri Sep 08 at 15:58 on (mom16:mem=1572864kb:ncpus=1)+(xcs_1515:ncpus=36)+(xcs_1516:ncpus=36)+(xcs_1517:ncpus=36)+(xcs_1518:ncpus=36) 5999628.xcs00 tojon normal MONCcont -- 4 144 1536mb 00:25 H -- tojon@xcslc0:/projects/nexcs-n02/tojon/MONC/vn0.8_rce/diagnostic_files> qdel 5999628.xcs00 tojon@xcslc0:/projects/nexcs-n02/tojon/MONC/vn0.8_rce/diagnostic_files> qdel 5997109.xcs00 tojon@xcslc0:/projects/nexcs-n02/tojon/MONC/vn0.8_rce/diagnostic_files> qdel 5998414.xcs00 qdel: Unknown error 18446744073709551614 5998414.xcs00 tojon@xcslc0:/projects/nexcs-n02/tojon/MONC/vn0.8_rce/diagnostic_files> qdel 5997097.xcs00 qdel: Unknown error 18446744073709551614 5997097.xcs00
Looking at all jobs, it appears that this has also happened to others around the same time. I'm simply at a loss for what occurred. I can't find any output errors. Please advise.
Change History (3)
comment:1 Changed 3 years ago by annette
comment:2 Changed 3 years ago by willie
- Status changed from new to pending
comment:3 Changed 3 years ago by willie
- Resolution set to answered
- Status changed from pending to closed
No further response from reporter.
Hi Todd,
This looks like an issue with the Monsoon system, so I will forward to their helpdesk.
Annette