Opened 2 months ago

Closed 4 weeks ago

#3228 closed help (fixed)

Unending dehlo_and_move

Reported by: m.couldrey Owned by: um_support
Component: Rose/Cylc Keywords: dehalo, stuck job
Cc: Platform: JASMIN
UM Version:

Description

Hi CMS

I set the suite u-bo201 running on my output from my GC3.1 experiment (suite u-bm978) and the dehalo_and_move task for one of the cycles (97) seems to have got stuck and is endlessly running. This isn't something I've seen before and I'm not sure what's got it stuck. It looks (from the job.out) that the job hung and got killed but not completely:
TERM_REMOVE_HUNG_JOB: hung job removed from the LSF system.
Exited

What would be the safest way to restart the job or check if anything is broken?

Thanks!
Matt

Change History (4)

comment:1 Changed 2 months ago by m.couldrey

Hi CMS

I just wanted to flag that I still haven't figured out the right way to fix this. I'm sure you're all busy given the upset state of things, and I recognise that this isn't a high priority item.

Cheers!
Matt

comment:2 Changed 2 months ago by jjabram

Hi Matt,

Sorry for the delayed response. I can't quite work out why the task may have hung.

There are a few steps to take to ensure a safe restart:

1) Make sure the cdds-holding directory for the offending cycle is empty. Found in ~/cylc-run/<suite-name>/work/<cycle-no.>/cdds-holding. Move anything currently inside of it out. This is to avoid any content that has been dehaloed so far in that cycle being overwritten.

2) Move anything already processed from the HALO_START_DIR sideways - The suite shouldn't dehalo a file twice, but best to avoid it where possible.

3) Stop the suite and then restart it using $ rose suite-restart

I hope this helps, let me know if the problem persists.

Cheers,
Joe

comment:3 Changed 8 weeks ago by m.couldrey

Hey Joe

Thanks for getting in touch with the tips! My cylc gui looks like a glorious garden now! (I.e. full of green i.e. running ok!)

Cheers!
Matt

comment:4 Changed 4 weeks ago by grenville

  • Resolution set to fixed
  • Status changed from new to closed
Note: See TracTickets for help on using tickets.