#2447 closed help (answered)

problem of "submit-failed"

Reported by: jfgu Owned by: um_support
Component: UM Model Keywords:
Cc: Platform: PUMA
UM Version: 11.0

Description

Dear CMS helpdesk,

I have a problem in submitting my job suite to Archer from last Friday. It stops when the suite starts to compile the model. It shows that "submit-failed" on GUI. However, there isn't any error information.

I have stopped the suite and tried to resubmit for several times. It stops at the step to compile the model most of the time. Sometimes, it succeed compiling the model, but failed to submit without providing any useful information.

Please does someone know what the problem is? I have to make it run today. Thank you very much!

By the way, my suite is u-ax173, which I upgrade from a UM10.9 suite to run UM11.0.

Best regards

Jian-Feng

Change History (4)

comment:1 Changed 17 months ago by ros

Hi Jian-Feng,

In the suite error file (log/suite/err) the suite is having problems connecting to ARCHER with rose host-select sometimes it's not being able to find an ok login node for some reason.

Please can you try running rose host-select on the PUMA command line. If it lists failed logins please try "ssh"ing to the failed nodes (e.g. ssh <username>@login3.archer.ac.uk) and follow any instructions.

If the suite still fails to submit with the same error message, then the easiest thing to do is replace

host = $(rose host-select {{ HPC_HOST }})

with

host = login.archer.ac.uk

in the suite.rc file

Cheers,
Ros.

comment:2 Changed 17 months ago by jfgu

Hi Ros,

I tried running rose host-select on PUMA, but it fails with

[FAIL] No (default) hosts specified.

Then I just replace host = $(rose host-select {{ HPC_HOST }}) with host = login.archer.ac.uk, and it succeed in compiling the model. But again, it fails to submit when starting the reconfiguration.The suite error file says:

2018-04-23T09:46:09Z ERROR - [job-submit cmd] cylc jobs-submit --host=login.archer.ac.uk --remote-mode -- '$HOME/cylc-run/u-ax173/log/job' 10000101T0000Z/UM_recon/01
        [job-submit ret_code] 191
        [job-submit out] 2018-04-23T10:46:08+01|10000101T0000Z/UM_recon/01|191|None
2018-04-23T09:46:09Z ERROR - [UM_recon.10000101T0000Z] -submission failed

The host is correct. I am not sure what's the problem now?

Regards
Jian-Feng

comment:3 Changed 17 months ago by ros

Sorry I missed the host off - should have been.

rose host-select archer

The full error message is in log/job/10000101T0000Z/UM_recon/01/job-activity.log

You are trying to submit it to the short queue but are requesting more than 20minutes which is not allowed. You will either need to change the queue to be standard or change the wall clock to be 20minutes or less.

Cheers,
Ros.

comment:4 Changed 16 months ago by ros

  • Resolution set to answered
  • Status changed from new to closed
Note: See TracTickets for help on using tickets.