Opened 7 months ago

Closed 7 months ago

Last modified 7 months ago

#2747 closed help (fixed)

Jobs stuck at 'submit-failed' u-bc158, u-be496, u-be497

Reported by: cwc46 Owned by: ros
Component: Rose/Cylc Keywords:
Cc: Platform: ARCHER
UM Version: 11.0

Description

Hi there,

I have 3 running jobs, u-bc158, u-be496 and u-be497.

They failed over the weekend and this was due to the ssh-agent resetting/n02-chem running out of resources. However even after fixing that, I am unable to restart the jobs, and they are stuck at 'submit-failed' or 'submit-retrying'.

Thank you!

Best wishes,
Glen

Attachments (1)

be497.png (28.6 KB) - added by cwc46 7 months ago.
be497

Download all attachments as: .zip

Change History (23)

comment:1 Changed 7 months ago by ros

Hi Glen,

Have you tried re-triggering the stuck tasks; ie. right click on the task → Trigger (run now)?

rose suite-run --restart will just restart the suite from the state it stopped in; i.e. any task that had failed or got stuck will still show up as stuck until you manually tell it what you want to do.

CHeers,
Ros.

comment:2 Changed 7 months ago by grenville

Glen
n02-chem has just been topped up.

Grenville

comment:3 Changed 7 months ago by cwc46

Dear Ros, Grenville,

Thanks for your help!

comment:4 Changed 7 months ago by cwc46

Dear Ros,

Somehow my jobs are still stuck at submit-retrying, and I'm trying out the suggestions as per #2447 but it's still not working..

comment:5 Changed 7 months ago by ros

  • Owner changed from um_support to ros
  • Status changed from new to accepted

Hi Glen,

We'll just concentrate on one suite to start with: u-be496

Looking at the log files for suite u-be496 cycle 1990901T0000Z atmos_main, housekeeping and postproc all say they have completed successfully. In the log/suite/err you will see that it couldn't submit the next tasks due to problems with rose host-select archer. So the instructions in the ticket you mention should indeed fix it namely:

Try running rose host-select archer on the PUMA command line. If it lists failed logins please try "ssh"ing to the failed nodes (e.g. ssh <username>@login3.archer.ac.uk) and follow any instructions.

If the suite still fails to submit with the same error message, then the easiest thing to do is replace

host = $(rose host-select {{ archer }})

with

host = login.archer.ac.uk

in the site/archer.rc file

Have you done all of this for u-be496?

Regards,
Ros.

comment:6 Changed 7 months ago by cwc46

Dear Ros,

Thanks for the advice.

u-be496 is running now, and u-bc158 is finished; u-be497 is still not running and I have followed the instructions as per your comment.

I have changed the host to host = login.archer.ac.uk in the site/archer.rc file, but now I get this error message when I try to key in rose suite-run —restart:

[FAIL] bash -ec H=$(rose\ host-select\ archer);\ echo\ $H # return-code=1, stderr=
[FAIL] [WARN] login5.archer.ac.uk: (ssh failed)
[FAIL] [WARN] login3.archer.ac.uk: (timed out)
[FAIL] [WARN] login4.archer.ac.uk: (timed out)
[FAIL] [WARN] login6.archer.ac.uk: (ssh failed)
[FAIL] [WARN] login8.archer.ac.uk: (ssh failed)
[FAIL] [WARN] login.archer.ac.uk: (timed out)
[FAIL] [WARN] login2.archer.ac.uk: (ssh failed)
[FAIL] [WARN] login1.archer.ac.uk: (ssh failed)
[FAIL] [WARN] login7.archer.ac.uk: (ssh failed)
[FAIL] [FAIL] No hosts selected.

Thank you,
Glen

comment:7 Changed 7 months ago by ros

Hi Glen,

For u-be497 in site/archer.rc you have unfortunately changed the wrong line.

You will need to revert the [[LINUX]] section back to:

   [[LINUX]]
        [[[environment]]]
            PLATFORM = linux
            UMDIR = ~um
        [[[job]]]
            batch system = background
        [[[remote]]]
            host = {{ROSE_ORIG_HOST}}

Then change the host = $(rose host-select archer) line in the [[HPC]] section.

Cheers,
Ros.

comment:8 Changed 7 months ago by cwc46

Dear Ros,

Ah, thank you, it is running now.

Best wishes,
Glen

comment:9 Changed 7 months ago by cwc46

Dear Ros,

Sorry for another note. Both u-be496 and u-be497 are failing at postproc now, and the error message is this:

tail: cannot open `/home/cwc46/cylc-run/u-be497/log/job/20000501T0000Z/postproc/01/job-activity.log' for reading: No such file or directory

Thank you for your help!

Best wishes,
Glen

comment:10 Changed 7 months ago by ros

Hi Glen,

You have exceeded you PUMA disk quota. There are error messages in the suite/err file u-be497.

Log files in cylc-run/u-bc158/log/job/19880901T0000Z and 19881001T0000Z are taking up over a 3rd of your PUMA disk space. You can safely remove the log files (particlulary job.out) from under the atoms_main directory in both of these. They are also available on ARCHER anyway.

Log files can eat away large chunks of quota so you will need to keep on top of them.

Cheers,
Ros.

Regards,
Ros.

comment:11 Changed 7 months ago by cwc46

Dear Ros,

Ah I see, thanks for the advice. I have deleted these log files but somehow the postproc is still not working, stuck at 'submit-failed'..

Best wishes,
Glen

comment:12 Changed 7 months ago by ros

Hi Glen,

To save me hunting through loads of log files can you please tell me which suite this is and which cycle is the one with the stuck postproc please?

Cheers,
Ros.

comment:13 Changed 7 months ago by cwc46

Dear Ros,

Thank you - the suites and cycle are:

u-be497, at 20000501T0000Z - RUN_MAIN - postproc

u-be496, at 20000701T0000Z - RUN_MAIN - postproc

u-bf405, at 19880901T0000Z - RUN_MAIN - fcm_make2_pp

Best wishes,
Glen

comment:14 Changed 7 months ago by ros

Thanks.

The log files for u-be497 indicate that postproc for 20000501 hasnt't even attempt to be run yet as it still thinks atmos_main is running. Please make sure that the atmos_main task for this cycle is showing as succeeded. If not manually poll that task and if that doesn't work manually set its status to succeeded. Then the postproc will run for this cycle or you can trigger it to run if it doesn't start automatically.

u-bf405 hasn't been run at all since I told you about the disk space issue so you need to retrigger the stuck tasks on that one.

u-be496 looks to be in a bit of a pickle so I'll need to look at that some more.

Cheers,
Ros.

Changed 7 months ago by cwc46

be497

comment:15 Changed 7 months ago by cwc46

Dear Ros,

Thanks for your help - I have attached a screenshot of what I'm seeing for u-be497..

atmos_main is 'succeeded' but I can't seem to do anything for postproc at the moment.

Best wishes,
Glen

comment:16 Changed 7 months ago by ros

Hi Glen,

That u-be497 GUI hasn't updated at all with the attempt at retriggering it and the lack of information added to the log file indicates the suite is not responding at all. So all I can suggest is to stop the suite, restart it and then try retriggering the postproc task again. See if that kicks it into action.

Cheers,
Ros

comment:17 Changed 7 months ago by cwc46

Dear Ros,

Ah, thanks, yes u-be497 is running now. Thanks!

Best wishes,
Glen

comment:18 Changed 7 months ago by cwc46

Dear Ros,

Sorry to bother again - I was wondering if you had time to look at u-be496?

Also, I am running into this error in my other job, u-bf405:
[FAIL] /home/cwc46/cylc-run/u-bf405/share/fcm_make_um/fcm-make.lock: lock exists at the destination

This is failing at fcm_make_um.

Thank you,
Glen

comment:19 Changed 7 months ago by ros

Hi Glen,

Not had time to look at u-be496 yet. You could try doing the same as you did for u-be497. Stop, restart and re-trigger the stuck/failed tasks.

u-bf405 - Remove the lock directory specified on PUMA and retrigger the task.

Cheers,
Ros.

comment:20 Changed 7 months ago by cwc46

Dear Ros,

Thanks, u-be496 is working now too, thank you for your help! and u-bf405 is unlocked now after deleting the lock directory.

Best wishes,
Glen

comment:21 Changed 7 months ago by ros

  • Resolution set to fixed
  • Status changed from accepted to closed

Thanks for letting me know. I shall close this query now.

Cheers,
Ros

comment:22 Changed 7 months ago by ros

  • Component changed from UM Model to Rose/Cylc
Note: See TracTickets for help on using tickets.