#2128 closed help (fixed)

postproc execution timeout

Reported by: mattjbr123 Owned by: ros
Priority: normal Component: UM Model
Keywords: postproc retry Cc:
Platform: Monsoon2 UM Version: 10.3

Description

Hi (again),

The postproc tasks of my suite u-ak617 are consistently hitting the 3hr timeout barrier. On the xcm this used to happen occasionaly and once the tasks retried they usually completed in the time, so not sure what's changed. As far as I can tell the tasks are running ok, just not completing in time, with nothing in else helpful in the job.out/err files, and lots of pumf_out_xxxx folders in work/cycletime/postproc_xxxxxxxxx. However looking on moose in moose:ens/u-ak617/postproc_r01xi1p00000 shows no pp files have been archived.

My guess is that it is running out of time doing the conversion of the file format that happens before the archive step in the postproc script, and that maybe it is starting from the beginning each time it retries rather than picking up where the previous task left off, but I'm not certain - would you be able to shed any light?

If there's some workaround for this that would be useful, otherwise I may just have to cut out some of the probably unnecessary STASH outputs. The suite uses postproc version 1.0 with the following branches:
fcm:moci.xm_br/dev/malcolmroberts/postproc_1.0_extra_output_ppstreams@521
fcm:moci.xm_br/dev/ericaneininger/postproc_1.0_archiving_ens_dataclass@507

Cheers,
Matt

Change History (7)

comment:1 Changed 11 months ago by ros

  • Owner changed from um_support to ros
  • Status changed from new to accepted

Hi Matt,

In the job.err files there is the error message:

[ERROR]  mkset: System error (Error=3)
         Unable to create set:moose:ens/u-ak617
[ERROR]  moo.py: Moose Error: error in Moose or its supporting systems (storage, database etc.). (ReturnCode=3) File: /home/d04/mabro/cylc-run/u-ak617/share/data/History_Data_r013i1p00000/ak617-r013i1p00000a.pa20081018

Annoyingly it's not writing out the mkset command it's trying to run and helpfully not terminating at that point either.

When you run on XCS:

moo ls moose:ens/u-ak617

What do you get? Wasn't quite sure from you description above whether it lists as empty or you get an error.

Cheers,
Ros.

comment:2 Changed 11 months ago by mattjbr123

Oh ok - didn't see that in the cylc gui but that's possibly because I only looked at the most recent task attempt.

That is odd, as the set moo:ens/u-ak617 already exists, I created it earler, so that the 'rose_arch_astart' task would run successfully. Therefore the output of

moo ls moose:ens/u-ak617

is

moose:/ens/u-ak617/r010i1p00000
moose:/ens/u-ak617/r011i1p00000
moose:/ens/u-ak617/r012i1p00000
moose:/ens/u-ak617/r013i1p00000
moose:/ens/u-ak617/r014i1p00000

each of these entries only has an ada.file collection(?) underneath it, e.g running

moo ls -l moose:ens/u-ak617/r010i1p00000

gets

C sarah.ineson                0.04 GBP       1000042496 2017-03-22 22:37:02 GMT moose:/ens/u-ak617/r010i1p00000/ada.file

maybe the problem lies in it being unable to create the collections e.g moose:ens/u-ak617/r010i1p00000/apa.pp etc?

The rose_arch_astart task was unable to create the moose:ens/u-ak617 set, which I had to do manually but didn't have a problem creating the ada.file collections under each ensemble 'ripcode' (r01xi1p00000) to store the astart files.

Last edited 11 months ago by mattjbr123 (previous) (diff)

comment:3 Changed 11 months ago by mattjbr123

Do you have any ideas regarding this? I haven't had much time to look at it lately but will hopefully in the next couple weeks…

Last edited 11 months ago by mattjbr123 (previous) (diff)

comment:4 Changed 11 months ago by ros

Hi Matt,

Sorry for the delay, I need to do some more investigation. Could you confirm whether the moo:ens/u-ak617 was created from the xcm or xcs-c?

Cheers,
Ros.

comment:5 Changed 11 months ago by mattjbr123

Not at all - thanks for looking into it! I ran

moo mkset moose:ens/u-ak617 -p=project-solar

from the xcs-c manually (not as part of a suite/task). This fixed the problem with the rose_arch_astart task.

Last edited 11 months ago by mattjbr123 (previous) (diff)

comment:6 Changed 11 months ago by ros

Hi Matt,

Still no luck yet I'm afraid. Just getting the Met Office to check if they can see anything odd at their end.

Cheers,
Ros.

comment:7 Changed 11 months ago by ros

  • Resolution set to fixed
  • Status changed from accepted to closed

Conversations off line with Met Office - all permissions set up correctly at their end.

Eventually noticed that the postproc task was being submitted to the compute nodes rather than the shared nodes.

Fix is to change site/MONSooN.rc to modify the [[POSTPROC_RESOURCE]] family to inherit HPC_SERIAL rather than HPC

Note: See TracTickets for help on using tickets.