Opened 4 years ago

Closed 4 years ago

#2128 closed help (fixed)

postproc execution timeout

Reported by: mattjbr123 Owned by: ros
Component: UM Model Keywords: postproc retry
Cc: Platform: Monsoon2
UM Version: 10.3


Hi (again),

The postproc tasks of my suite u-ak617 are consistently hitting the 3hr timeout barrier. On the xcm this used to happen occasionaly and once the tasks retried they usually completed in the time, so not sure what's changed. As far as I can tell the tasks are running ok, just not completing in time, with nothing in else helpful in the job.out/err files, and lots of pumf_out_xxxx folders in work/cycletime/postproc_xxxxxxxxx. However looking on moose in moose:ens/u-ak617/postproc_r01xi1p00000 shows no pp files have been archived.

My guess is that it is running out of time doing the conversion of the file format that happens before the archive step in the postproc script, and that maybe it is starting from the beginning each time it retries rather than picking up where the previous task left off, but I'm not certain - would you be able to shed any light?

If there's some workaround for this that would be useful, otherwise I may just have to cut out some of the probably unnecessary STASH outputs. The suite uses postproc version 1.0 with the following branches:


Change History (7)

comment:1 Changed 4 years ago by ros

  • Owner changed from um_support to ros
  • Status changed from new to accepted

Hi Matt,

In the job.err files there is the error message:

[ERROR]  mkset: System error (Error=3)
         Unable to create set:moose:ens/u-ak617
[ERROR] Moose Error: error in Moose or its supporting systems (storage, database etc.). (ReturnCode=3) File: /home/d04/mabro/cylc-run/u-ak617/share/data/History_Data_r013i1p00000/ak617-r013i1p00000a.pa20081018

Annoyingly it's not writing out the mkset command it's trying to run and helpfully not terminating at that point either.

When you run on XCS:

moo ls moose:ens/u-ak617

What do you get? Wasn't quite sure from you description above whether it lists as empty or you get an error.


comment:2 Changed 4 years ago by mattjbr123

Oh ok - didn't see that in the cylc gui but that's possibly because I only looked at the most recent task attempt.

That is odd, as the set moo:ens/u-ak617 already exists, I created it earler, so that the 'rose_arch_astart' task would run successfully. Therefore the output of

moo ls moose:ens/u-ak617



each of these entries only has an ada.file collection(?) underneath it, e.g running

moo ls -l moose:ens/u-ak617/r010i1p00000


C sarah.ineson                0.04 GBP       1000042496 2017-03-22 22:37:02 GMT moose:/ens/u-ak617/r010i1p00000/ada.file

maybe the problem lies in it being unable to create the collections e.g moose:ens/u-ak617/r010i1p00000/apa.pp etc?

The rose_arch_astart task was unable to create the moose:ens/u-ak617 set, which I had to do manually but didn't have a problem creating the ada.file collections under each ensemble 'ripcode' (r01xi1p00000) to store the astart files.

Last edited 4 years ago by mattjbr123 (previous) (diff)

comment:3 Changed 4 years ago by mattjbr123

Do you have any ideas regarding this? I haven't had much time to look at it lately but will hopefully in the next couple weeks…

Last edited 4 years ago by mattjbr123 (previous) (diff)

comment:4 Changed 4 years ago by ros

Hi Matt,

Sorry for the delay, I need to do some more investigation. Could you confirm whether the moo:ens/u-ak617 was created from the xcm or xcs-c?


comment:5 Changed 4 years ago by mattjbr123

Not at all - thanks for looking into it! I ran

moo mkset moose:ens/u-ak617 -p=project-solar

from the xcs-c manually (not as part of a suite/task). This fixed the problem with the rose_arch_astart task.

Last edited 4 years ago by mattjbr123 (previous) (diff)

comment:6 Changed 4 years ago by ros

Hi Matt,

Still no luck yet I'm afraid. Just getting the Met Office to check if they can see anything odd at their end.


comment:7 Changed 4 years ago by ros

  • Resolution set to fixed
  • Status changed from accepted to closed

Conversations off line with Met Office - all permissions set up correctly at their end.

Eventually noticed that the postproc task was being submitted to the compute nodes rather than the shared nodes.

Fix is to change site/MONSooN.rc to modify the [[POSTPROC_RESOURCE]] family to inherit HPC_SERIAL rather than HPC

Note: See TracTickets for help on using tickets.