Opened 4 months ago

Closed 4 months ago

#3052 closed help (fixed)

dehalo failure

Reported by: m.couldrey Owned by: um_support
Component: Rose/Cylc Keywords: dehalo
Cc: Platform: JASMIN
UM Version:

Description

Hi CMS

I've set up suite u-bo201 on jasmin-cylc to dehalo my suite u-bi909. The suite seems to have successfully operated on 9 cycles of output, but has failed on the 10th.
I had a look at the job.err and it looked like this
Traceback (most recent call last):

File "/apps/contrib/metomi/cylc-7.8.1/bin/cylc-cat-log", line 439, in <module>

main()

File "/apps/contrib/metomi/cylc-7.8.1/bin/cylc-cat-log", line 435, in main

tmpfile_edit(out, options.geditor)

File "/apps/contrib/metomi/cylc-7.8.1/bin/cylc-cat-log", line 268, in tmpfile_edit

proc = Popen(cmd, stderr=PIPE)

File "/usr/lib64/python2.6/subprocess.py", line 642, in init

errread, errwrite)

File "/usr/lib64/python2.6/subprocess.py", line 1238, in _execute_child

raise child_exception

OSError: [Errno 2] No such file or directory

I then tried retriggering the task, hoping it was just one of those things, but no luck there.

The subsequent cycles are now all failing in the same way too. Any help would be much appreciated! Thanks!

Change History (9)

comment:1 follow-up: Changed 4 months ago by jjabram

Hi Matt,

I'm just looking into this for you now. Is it the dehalo_and_move task that is failing?

Regards,

Joe

comment:2 in reply to: ↑ 1 Changed 4 months ago by m.couldrey

Thanks for looking! Yes, it's the dehalo_and_move task.

comment:3 Changed 4 months ago by jjabram

Hi Matt,

I've found the problem, the cycles aren't picking up the directories as intended.

I'm working on a fix for this now. I'll let you know as soon as I've got it sorted.

Regards

Joe

comment:4 Changed 4 months ago by jjabram

In the mean time, I would pause the suite if you haven't already. But don't stop it as we should be able to restart it from cycle 10 once I've got it sorted.

Joe

comment:5 Changed 4 months ago by m.couldrey

Ah ok, thanks! I've held the suite for now.

comment:6 Changed 4 months ago by jjabram

Hi Matt,

If you go into the dehalo suite and find the python files for both the file_check script and the dehalo_and_move script:

$ vi u-bo201/app/dehalo_and_move/bin/remove_nemo_halo.py
and
$ vi u-bo201/app/file_check/bin/file_check.py

both scripts have a line near the top:

CYCLE_NUM=int(os.environCYLC_TASK_JOB?[0])

which needs to be replaced with:

CYCLE_NUM=int(os.environCYLC_TASK_JOB?.split('/')[0])

Once you've done this you should be able to run $ rose suite-run —reload and trigger each cycle again. OR If you move 'mv' the 10 directories that have already dehaloed out of u-bi909 into a side folder, you should be able to completely stop and re-run the suite.

Regards,

Joe

comment:7 Changed 4 months ago by m.couldrey

Hi Joe

The job is still running as of 10:00 today- it looks like it's still ticking over happily.

Interestingly; upon re-reading your message you say that perhaps both remove_nemo_halo.py & file_check.py needed editing. We only changed the remove_halo script yesterday, yet the job seems to be running alright. Do you think it's worth making the same change to file_check as well?

Thanks for finding this one so quickly!

Matt

comment:8 Changed 4 months ago by jjabram

Hi Matt,

That's good to hear, I've been keeping an eye on it myself as well.

The file_check is used to stop new cycles opening once there are no more directories to dehalo, but isn't otherwise linked to the dehalo task. So the suite should perform the dehalo process as intended, but the file_check task will continue.

You will probably need to manually stop the suite once the last directory is complete, but it shouldn't cause an issue.

Regards,

Joe

comment:9 Changed 4 months ago by jjabram

  • Resolution set to fixed
  • Status changed from new to closed
Note: See TracTickets for help on using tickets.