postproc jobs lag substantially behind model run cycles

Hi, my current suites u-ar155, u-ar286 and u-ar292 seem to be suffering from a postproc run delay. I have checked MONSooN Yammer messages etc but found no downtime or other reasons for delayed copying of model output to MASS, assuming that this would be the most likely possibility for a delay.
I am frequently confused by gcylc showing consistently running status, qstat on xcs showing R status, but then in reality the jobs seem to be no longer running? Is this possibly related to earlier disk issues with the cylc server, mentioned e.g. in ticket #2310?
Many thanks,

comment:1 Changed 17 months ago by marcus

Hi, postproc for u-ar286 has now failed with the following error, relating to um-pumf, that I do not quite understand:

[WARN]  [SUBPROCESS]: Command: /projects/um1/vn10.6/xc40/utilities/um-pumf -h /home/d03/makoe/cylc-run/u-ar286/log/job/19790101T0000Z/postproc/09/job-pumfhead.out /home/d03/makoe/cylc-run/u-ar286/share/data/History_Data/ar286a.pa1978dec
[SUBPROCESS]: Error = 1:
	[INFO] File(1): /home/d03/makoe/cylc-run/u-ar286/share/data/History_Data/ar286a.pa1978dec
[WARN] Using default STASHmaster as none provided "/projects/um1/vn10.6/ctldata/STASHmaster".
[INFO] Using script: /projects/um1/vn10.6/xc40/utilities/um-pumf
[INFO] Using executable: /projects/um1/vn10.6/xc40/utilities/um-pumf.exe
/projects/um1/vn10.6/xc40/utilities/um-pumf: line 198: 72097 Aborted                 (core dumped) $pumf_exec > $PUMF_OUT 2>&1
[INFO] Header output in:   /home/d03/makoe/cylc-run/u-ar286/log/job/19790101T0000Z/postproc/09/job-pumfhead.out
[INFO] Field output in:    /home/d03/makoe/cylc-run/u-ar286/work/19790101T0000Z/postproc/pumf_out_MEyr/pumf_field
[FAIL] Problem with PUMF program

[ERROR]  pumf: Failed to extract header information from file /home/d03/makoe/cylc-run/u-ar286/share/data/History_Data/ar286a.pa1978dec
[FAIL]  Command Terminated
[FAIL] Terminating PostProc...
[FAIL] main_pp.py atmos # return-code=1
2017-11-17T09:37:41Z CRITICAL - Task job script received signal EXIT

Please what could I do to fix this?

Many thanks,

comment:2 Changed 17 months ago by willie

Hi Marcus,

The file


is corrupted: if you look at it in xconv you get the error "WGDOS data header record mismatch". This is probably why pumf fails. I think this is related to the "Updraught maas flux" field, though my analysis tools also fail on the file.

So you should check your run processes which produced this file and see if any errors were reported.


comment:3 Changed 17 months ago by marcus

Thank you, Willie. I am not sure how this could have happened. I have stopped this run now and will continue from an earlier dump when I resume this experiment.
Many thanks, Marcus

comment:4 Changed 17 months ago by willie

  Resolution set to fixed
  Status changed from new to closed
