Opened 3 months ago

Closed 4 weeks ago

#3018 closed help (fixed)

UKESM1 postproc errors/issues

Reported by: luke Owned by: um_support
Component: UKESM Keywords: postproc
Cc: Platform: Monsoon2
UM Version: 11.2

Description

Hello,

I have 3 UKESM1 simulations running on Monsoon2 (u-bl561, u-bm009, & u-bm020), each of which are having issues with the postproc_ tasks during runtime, usually the postproc_nemo_* ones, but occasionally postproc_atmos.

Sometimes the tasks will fall into the orange "retrying" state. Here I need to do a right-click —> reset state —> waiting (sometimes multiple times) before it will run successfully. Occasionally the task will fail with errors like, e.g

[WARN] file:atmospp.nl: skip missing optional source: namelist:archer_arch
[WARN] file:nemocicepp.nl: skip missing optional source: namelist:archer_arch
[WARN] file:pptransfer.nl: skip missing optional source: namelist:archer_arch
[WARN] file:pptransfer.nl: skip missing optional source: namelist:pptransfer
[WARN] file:atmospp.nl: skip missing optional source: namelist:script_arch
[WARN] file:nemocicepp.nl: skip missing optional source: namelist:script_arch
[WARN]  [SUBPROCESS]: Command: moo test -sw moose:crum/u-bm020
[SUBPROCESS]: Error = 13:
	Permission denied
[WARN]  [SUBPROCESS]: Command: moo mkset -v -p project-ukca --single-copy moose:crum/u-bm020
[SUBPROCESS]: Error = 13:
	Permission denied
[WARN]  mkset: System error (Error=13)
Permission denied
	 Unable to create set:moose:crum/u-bm020
[WARN]  [SUBPROCESS]: Command: moo put -f -vv /home/d00/hadlk/cylc-run/u-bm020/share/data/History_Data/NEMOhist/archive_ready/medusa_bm020o_1m_20270601-20270701_diad-T.nc moose:crum/u-bm020/onm.nc.file
[SUBPROCESS]: Error = 13:
	Permission denied
[WARN]  moo.py: Unknown Error - Return Code =13
[WARN]  Failed to archive file: /home/d00/hadlk/cylc-run/u-bm020/share/data/History_Data/NEMOhist/archive_ready/medusa_bm020o_1m_20270601-20270701_diad-T.nc. Will try again later.
[WARN]  [SUBPROCESS]: Command: moo test -sw moose:crum/u-bm020
[SUBPROCESS]: Error = 13:
	Permission denied
[WARN]  [SUBPROCESS]: Command: moo mkset -v -p project-ukca --single-copy moose:crum/u-bm020
[SUBPROCESS]: Error = 13:
	Permission denied
[WARN]  mkset: System error (Error=13)
Permission denied
	 Unable to create set:moose:crum/u-bm020
[WARN]  [SUBPROCESS]: Command: moo put -f -vv /home/d00/hadlk/cylc-run/u-bm020/share/data/History_Data/NEMOhist/archive_ready/medusa_bm020o_1m_20270701-20270801_diad-T.nc moose:crum/u-bm020/onm.nc.file
[SUBPROCESS]: Error = 13:
	Permission denied
[WARN]  moo.py: Unknown Error - Return Code =13
[WARN]  Failed to archive file: /home/d00/hadlk/cylc-run/u-bm020/share/data/History_Data/NEMOhist/archive_ready/medusa_bm020o_1m_20270701-20270801_diad-T.nc. Will try again later.
[WARN]  [SUBPROCESS]: Command: moo test -sw moose:crum/u-bm020
[SUBPROCESS]: Error = 13:
	Permission denied
[WARN]  [SUBPROCESS]: Command: moo mkset -v -p project-ukca --single-copy moose:crum/u-bm020
[SUBPROCESS]: Error = 13:
	Permission denied
[WARN]  mkset: System error (Error=13)
Permission denied
	 Unable to create set:moose:crum/u-bm020
[WARN]  [SUBPROCESS]: Command: moo put -f -vv /home/d00/hadlk/cylc-run/u-bm020/share/data/History_Data/NEMOhist/archive_ready/medusa_bm020o_1m_20270801-20270901_diad-T.nc moose:crum/u-bm020/onm.nc.file
[SUBPROCESS]: Error = 13:
	Permission denied
[WARN]  moo.py: Unknown Error - Return Code =13
[WARN]  Failed to archive file: /home/d00/hadlk/cylc-run/u-bm020/share/data/History_Data/NEMOhist/archive_ready/medusa_bm020o_1m_20270801-20270901_diad-T.nc. Will try again later.
[WARN]  [SUBPROCESS]: Command: moo test -sw moose:crum/u-bm020
[SUBPROCESS]: Error = 13:
	Permission denied
[WARN]  [SUBPROCESS]: Command: moo mkset -v -p project-ukca --single-copy moose:crum/u-bm020
[SUBPROCESS]: Error = 13:
	Permission denied
[WARN]  mkset: System error (Error=13)
Permission denied
	 Unable to create set:moose:crum/u-bm020
[WARN]  [SUBPROCESS]: Command: moo put -f -vv /home/d00/hadlk/cylc-run/u-bm020/share/data/History_Data/NEMOhist/archive_ready/medusa_bm020o_1d_20270401-20270701_ptrc-T.nc moose:crum/u-bm020/ond.nc.file
[SUBPROCESS]: Error = 13:
	Permission denied
[WARN]  moo.py: Unknown Error - Return Code =13
[WARN]  Failed to archive file: /home/d00/hadlk/cylc-run/u-bm020/share/data/History_Data/NEMOhist/archive_ready/medusa_bm020o_1d_20270401-20270701_ptrc-T.nc. Will try again later.
[WARN]  [SUBPROCESS]: Command: moo test -sw moose:crum/u-bm020
[SUBPROCESS]: Error = 13:
	Permission denied
[WARN]  [SUBPROCESS]: Command: moo mkset -v -p project-ukca --single-copy moose:crum/u-bm020
[SUBPROCESS]: Error = 13:
	Permission denied
[WARN]  mkset: System error (Error=13)
Permission denied
	 Unable to create set:moose:crum/u-bm020
[WARN]  [SUBPROCESS]: Command: moo put -f -vv /home/d00/hadlk/cylc-run/u-bm020/share/data/History_Data/NEMOhist/archive_ready/medusa_bm020o_1m_20270601-20270701_ptrc-T.nc moose:crum/u-bm020/onm.nc.file
[SUBPROCESS]: Error = 13:
	Permission denied
[WARN]  moo.py: Unknown Error - Return Code =13
[WARN]  Failed to archive file: /home/d00/hadlk/cylc-run/u-bm020/share/data/History_Data/NEMOhist/archive_ready/medusa_bm020o_1m_20270601-20270701_ptrc-T.nc. Will try again later.
[WARN]  [SUBPROCESS]: Command: moo test -sw moose:crum/u-bm020
[SUBPROCESS]: Error = 13:
	Permission denied
[WARN]  [SUBPROCESS]: Command: moo mkset -v -p project-ukca --single-copy moose:crum/u-bm020
[SUBPROCESS]: Error = 13:
	Permission denied
[WARN]  mkset: System error (Error=13)
Permission denied
	 Unable to create set:moose:crum/u-bm020
[WARN]  [SUBPROCESS]: Command: moo put -f -vv /home/d00/hadlk/cylc-run/u-bm020/share/data/History_Data/NEMOhist/archive_ready/medusa_bm020o_1m_20270701-20270801_ptrc-T.nc moose:crum/u-bm020/onm.nc.file
[SUBPROCESS]: Error = 13:
	Permission denied
[WARN]  moo.py: Unknown Error - Return Code =13
[WARN]  Failed to archive file: /home/d00/hadlk/cylc-run/u-bm020/share/data/History_Data/NEMOhist/archive_ready/medusa_bm020o_1m_20270701-20270801_ptrc-T.nc. Will try again later.
[WARN]  [SUBPROCESS]: Command: moo test -sw moose:crum/u-bm020
[SUBPROCESS]: Error = 13:
	Permission denied
[WARN]  [SUBPROCESS]: Command: moo mkset -v -p project-ukca --single-copy moose:crum/u-bm020
[SUBPROCESS]: Error = 13:
	Permission denied
[WARN]  mkset: System error (Error=13)
Permission denied
	 Unable to create set:moose:crum/u-bm020
[WARN]  [SUBPROCESS]: Command: moo put -f -vv /home/d00/hadlk/cylc-run/u-bm020/share/data/History_Data/NEMOhist/archive_ready/medusa_bm020o_1m_20270801-20270901_ptrc-T.nc moose:crum/u-bm020/onm.nc.file
[SUBPROCESS]: Error = 13:
	Permission denied
[WARN]  moo.py: Unknown Error - Return Code =13
[WARN]  Failed to archive file: /home/d00/hadlk/cylc-run/u-bm020/share/data/History_Data/NEMOhist/archive_ready/medusa_bm020o_1m_20270801-20270901_ptrc-T.nc. Will try again later.
[FAIL]  main_pp.py - PostProc complete. Exiting with errors in nemo_archive
[FAIL] Terminating PostProc...
[FAIL] main_pp.py nemo # return-code=1
2019-09-19T12:29:41Z CRITICAL - failed/EXIT

for a postproc_nemo_ptrc task. The MASS errors seem to be the same though, whatever the task.

Eventually, I am able to get these to succeed, but only through repeatedly resetting the state to waiting again.

This dataset has already been created on MASS though, and other postproc tasks have succeeded at similar times to when these have failed.

Any and all advice as to how to prevent these failures would be gratefully received, as these are slowing the progress of these jobs down significantly and make them very labour-intensive to run as I must monitor them constantly.

Many thanks and best wishes,
Luke

Change History (5)

comment:1 Changed 3 months ago by willie

Hi Luke,

You're missing a slash I think

moo test -sw moose:/crum/u-bm020

Willie

comment:2 Changed 3 months ago by luke

Hi Willie,

Thanks for this. The first / doesn't actually matter, e.g.

[13:14:54 hadlk@xcslc0 ~]$ moo test -sw moose:/crum/u-bm020
true
[13:14:57 hadlk@xcslc0 ~]$ moo test -sw moose:crum/u-bm020
true

Also, this command is inside one of the postproc scripts (which I didn't write), so would fail consistently. For me it fails inconsistently. Last week, nothing seemed to get the jobs to work, at the start of this week there were no problems, and then they started playing-up again yesterday or so.

Thanks,
Luke

comment:3 Changed 3 months ago by luke

I've just emailed Monsoon to ask if they have any advice.

Thanks,
Luke

comment:4 Changed 3 months ago by luke

All my suites just stopped running. I am restarting them.

comment:5 Changed 4 weeks ago by luke

  • Resolution set to fixed
  • Status changed from new to closed

After extensive discussion with the Met Office MASS, CRUM, & HPC teams, it was determined that the problem was with shared nodes 91 & 92 - see communication from Roger Milton:

Various members of the HPC Team have done a bit of digging, and as a result have taken shared nodes 91 and 92 out of circulation, as the suspicion is that these shared nodes have an issue.

If the cause is due to these particular nodes, that would explain the intermittent nature of what you have been seeing (they’re only allocated work after the others have been filled), and you shouldn’t see any more.

Once these nodes were taken out of action the postproc tasks have continued as expected any further incident.

Note: See TracTickets for help on using tickets.