Opened 2 months ago

Closed 2 months ago

#2719 closed help (fixed)

Unable to submit run

Reported by: anmcr Owned by: um_support
Priority: normal Component: UM Model
Keywords: Cc:
Platform: Monsoon2 UM Version: 10.4

Description

Hello,

I had a vn10.4 of the nested suite running a couple of years ago on Monsoon (pre Monsoon 2), called u-ag300. I would like to re-run it, but am unable to submit the run. I get an error 'RosePopenError?' 'No hosts selected'. See attachment for more info. I would be grateful for some advice.

Many thanks,

Andrew

Attachments (2)

ncas_ticket.PNG (128.1 KB) - added by anmcr 2 months ago.
for_willie.PNG (118.1 KB) - added by anmcr 2 months ago.

Download all attachments as: .zip

Change History (12)

Changed 2 months ago by anmcr

comment:1 Changed 2 months ago by willie

Hi Andrew,

The clue is in the first part of the error message: SSH has failed because it does not recognise the computer 'xcf'.

In your site/monsoon-cray-xc40/suite-adds.rc file, change 'xcf' in the first line to 'xcs' and try again.

Willie

comment:2 Changed 2 months ago by anmcr

Dear Willie,

Thanks for the reply and looking at the problem.

I made the change you suggested, but I'm still getting the 'no hosts selected' error when I submit. See attachement.

Best wishes,

Andrew

Changed 2 months ago by anmcr

comment:3 Changed 2 months ago by willie

Hi Andrew,

The first problem has been solved. This is a new problem of the same type: it does not understand what `linux' is. If you type

rose host-select linux

on exvmsrose you will see the error.

Looking at your site/monsoon-cray-xc40/suite-adds.rc you will see

{% set IDL_SERVER = "linux" %}

near the top. I think you can just change linux to postproc and then it will find the computer.

Willie

comment:4 Changed 2 months ago by anmcr

Hi again Willie,

Thank you again for looking at this issue, and resolving it.

I'm afraid that I'm met another hitch in the reconfiguration stage of the global model, that has me stumped. The STDOUT file is here: home/d01/amworr/cylc-run/u-ag300/log/job/20110118T0000Z/glm_um_recon/02/job.out. The STDERR file is here: /home/d01/amworr/cylc-run/u-ag300/log/job/20110118T0000Z/glm_um_recon/02/job.err.

You will see that there is almost no information as to why this step has failed. The only clue is 'line 80: /home/d01/amworr/cylc-run/u-ag300/share/data/etc/ancil_versions_dm: No such file or directory'. I have never came across this problem, and couldn't find it in any previous tickets.

I would be grateful if you could please advise.

Many thanks,

Andrew

comment:5 Changed 2 months ago by willie

H Andrew,

Your ancil_versions_dm file is missing. This is because it is a link

 /home/d01/amworr/cylc-run/u-ag300/share/data/etc/ancil_versions_dm -> /home/swebst/CAP/ancil_versions/n512e/GA6.0/latest/ancils

Stuart's user name was changed from swebst to hadsw, so you will find this file under /home/d03/hadsw.

Willie

comment:6 Changed 2 months ago by anmcr

Hi Willie,

Thanks for fixing this.

I have one small remaining problem. The run is failing when it tries to archive the files with the error 'attempt to archive a zero-length file', which refers to a file with the ending 'pc000.pp'. See error output below. I don't know how this is being produced, as I am not archiving any files with 'pc000'. I was in a previous run, so I ran 'rose-suite-run —new' in order to have a clean start, but the problem still persists.

Are you able to please advise?

Thanks,

Andrew

This computer is provided for the processing of official information.
Unauthorised access described in Met Office SyOps? may constitute a criminal offence.
All activity on the system is liable to monitoring.
[FAIL] moo put -f /home/d01/amworr/cylc-run/u-be146/work/20140701T0000Z/AntarcticCORDEX_0p44deg_ga6_archive/tmpQETzXU/20140701T0000Z_AntarcticCORDEX_0p44deg_ga6_pvera000.pp /home/d01/amworr/cylc-run/u-be146/work/20140701T0000Z/AntarcticCORDEX_0p44deg_ga6_archive/tmpQETzXU/20140701T0000Z_AntarcticCORDEX_0p44deg_ga6_pb000.pp /home/d01/amworr/cylc-run/u-be146/work/20140701T0000Z/AntarcticCORDEX_0p44deg_ga6_archive/tmpQETzXU/20140701T0000Z_AntarcticCORDEX_0p44deg_ga6_pverb000.pp /home/d01/amworr/cylc-run/u-be146/work/20140701T0000Z/AntarcticCORDEX_0p44deg_ga6_archive/tmpQETzXU/20140701T0000Z_AntarcticCORDEX_0p44deg_ga6_pverd000.pp /home/d01/amworr/cylc-run/u-be146/work/20140701T0000Z/AntarcticCORDEX_0p44deg_ga6_archive/tmpQETzXU/20140701T0000Z_AntarcticCORDEX_0p44deg_ga6_pverc000.pp /home/d01/amworr/cylc-run/u-be146/work/20140701T0000Z/AntarcticCORDEX_0p44deg_ga6_archive/tmpQETzXU/20140701T0000Z_AntarcticCORDEX_0p44deg_ga6_pa000.pp /home/d01/amworr/cylc-run/u-be146/work/20140701T0000Z/AntarcticCORDEX_0p44deg_ga6_archive/tmpQETzXU/20140701T0000Z_AntarcticCORDEX_0p44deg_ga6_pc000.pp moose:/devfc/u-be146/field.pp/ # return-code=11, stderr=
[FAIL] /home/d01/amworr/cylc-run/u-be146/work/20140701T0000Z/AntarcticCORDEX_0p44deg_ga6_archive/tmpQETzXU/20140701T0000Z_AntarcticCORDEX_0p44deg_ga6_pc000.pp: (ERROR_CLIENT_ZERO_LENGTH_FILE) attempted to archive a zero-length file.
[FAIL] put: failed (11)
[FAIL] ! moose:/devfc/u-be146/field.pp/ [compress=None, t(init)=2019-01-16T08:55:40Z, dt(tran)=1s, dt(arch)=3s, ret-code=11]
[FAIL] ! 20140701T0000Z_AntarcticCORDEX_0p44deg_ga6_pa000.pp (umnsaa_pa000)
[FAIL] ! 20140701T0000Z_AntarcticCORDEX_0p44deg_ga6_pb000.pp (umnsaa_pb000)
[FAIL] ! 20140701T0000Z_AntarcticCORDEX_0p44deg_ga6_pc000.pp (umnsaa_pc000)
[FAIL] ! 20140701T0000Z_AntarcticCORDEX_0p44deg_ga6_pvera000.pp (umnsaa_pvera000)
[FAIL] ! 20140701T0000Z_AntarcticCORDEX_0p44deg_ga6_pverb000.pp (umnsaa_pverb000)
[FAIL] ! 20140701T0000Z_AntarcticCORDEX_0p44deg_ga6_pverc000.pp (umnsaa_pverc000)
[FAIL] ! 20140701T0000Z_AntarcticCORDEX_0p44deg_ga6_pverd000.pp (umnsaa_pverd000)
2019-01-16T08:55:46Z CRITICAL - failed/EXIT

comment:7 Changed 2 months ago by willie

HI Andrew,

If you look at the file

20140701T0000Z/AntarcticCORDEX/0p44deg/ga6/um/umnsaa_pc000

in xconv, you will see it has no data. This will cause the pp conversion to fail. Did you run for long enough to get output?

Willie

comment:8 Changed 2 months ago by anmcr

Hi Willie,

Thanks for the reply.

What I don't understand is how the 'umsaa_pc000' file is being produced. In my usage profile I am only using '60_diags' and '61_diags', which are associated with 'pp0' and 'pp1', which in the model output streams refer to 'umsaa_pa000' and 'umsaa_pb000'.

Best wishes,

Andrew

comment:9 Changed 2 months ago by anmcr

Dear Wllie,

I just deleted the file, and it archived properly. The 'umsaa_pc000' file was left over from an earlier run.

Please close this ticket, and thanks again for your help.

Andrew

comment:10 Changed 2 months ago by willie

  • Resolution set to fixed
  • Status changed from new to closed
Note: See TracTickets for help on using tickets.