#1744 answered v7.3 UKCA job seg-faulting at start-up with "Back-end never delivered its pid" atpAppSigHandler error um_support gmann

I am having a problem with my v7.3 UM-UKCA job failing with a strange error I have not seen before — it seems to be compiling and reconfiguring OK but is crashing (I think immediately) with a seg-fault — it's dumping a core and giving an error message I've not seen before which is shown below.

The 2 jobs are xlypk and xlypn — they both seem to be suffering the same failure.

These two jobs are very similar — uses some updated codebase and additional diagnostics.

The job xlypt runs OK — see the leave file xlypt000.xlypt.d15320.t162348.leave in directory:


The error is below.

Has this error message been encountered before?

If you could give me any pointers as to what the likely cause of the problem is that would be much appreciated.

Many thanks for your help,

Cheers Graham

atpAppSigHandler: Back-end never delivered its pid. Re-raising signal. atpAppSigHandler: Back-end never delivered its pid. Re-raising signal. _pmiu_daemon(SIGCHLD): [NID 03257] [c0-2c2s14n1] [Thu Nov 26 10:15:10 2015] PE RANK 50 exit signal Segmentation fault [NID 03257] 2015-11-26 10:15:11 Apid 18838319: initiated application termination _pmiu_daemon(SIGCHLD): [NID 03262] [c0-2c2s15n2] [Thu Nov 26 10:15:10 2015] PE RANK 153 exit signal Segmentation fault _pmiu_daemon(SIGCHLD): [NID 03263] [c0-2c2s15n3] [Thu Nov 26 10:15:10 2015] PE RANK 185 exit signal Segmentation fault _pmiu_daemon(SIGCHLD): [NID 03255] [c0-2c2s13n3] [Thu Nov 26 10:15:10 2015] PE RANK 6 exit signal Segmentation fault _pmiu_daemon(SIGCHLD): [NID 03259] [c0-2c2s14n3] [Thu Nov 26 10:15:10 2015] PE RANK 98 exit signal Segmentation fault _pmiu_daemon(SIGCHLD): [NID 03261] [c0-2c2s15n1] [Thu Nov 26 10:15:10 2015] PE RANK 122 exit signal Segmentation fault _pmiu_daemon(SIGCHLD): [NID 03258] [c0-2c2s14n2] [Thu Nov 26 10:15:10 2015] PE RANK 74 exit signal Segmentation fault _pmiu_daemon(SIGCHLD): [NID 03256] [c0-2c2s14n0] [Thu Nov 26 10:15:10 2015] PE RANK 28 exit signal Segmentation fault xlypk: Run failed *

Ending script : qsexecute Completion code : 139 Completion time : Thu Nov 26 10:15:18 GMT 2015


/work/n02/n02/gmann/um/xlypk/bin/qsmaster: Failed in qsexecute in model xlypk *

Starting script : qsfinal Starting time : Thu Nov 26 10:15:30 GMT 2015


/work/n02/n02/gmann/um/xlypk/bin/qsfinal: Error in exit processing after model run Failed in model executable

/work/n02/n02/gmann/um/xlypk/bin/qsfinal: Model xlypk - Error: No history files *

Ending script : qsfinal Completion code : 135 Completion time : Thu Nov 26 10:15:31 GMT 2015


#716 fixed v6.6.3 CRUN isn't queued ros iamack


I'm trying my first run of UM v6.6.3 on HECToR, job xgogk. As far as I can tell the NRUN has completed succesfully. I then edit the SUBMIT script on PUMA to set TYPE=CRUN, STEP=4, and RCF_NEW_EXEC=false, and resubmit via the umui without processing. The submission appears to be successful and I am told that the output will appear in a named .leave file. However on HECToR the submission seems to fail immediately with no job being queued and no .leave file appearing. Can you suggest where the problem might lie?


#2665 fixed v10.9 AMIP run um_support admg26


MonSOON 2 vn 10.9 ukca suite: u-bb701

I had a run which was working last week. I changed the run length and now fcm_make_um is failing with

[FAIL] extract.location{primary}[um] = fcm:um.xm: cannot modify, value is inherited
[FAIL] config-file=/projects/nexcs-n02/almin/cylc-run/u-bb701/work/18500901T0000Z/fcm_make_um/fcm-make.cfg:2
[FAIL] config-file= - file:///home/d04/fcm/srv/svn/um.xm/main/trunk/fcm-make/meto-xc40-cce/um-atmos-safe.cfg@45850:10
[FAIL] config-file= -  - file:///home/d04/fcm/srv/svn/um.xm/main/trunk/fcm-make/inc/um-atmos-common.cfg@45850:29

[FAIL] fcm make -f /projects/nexcs-n02/almin/cylc-run/u-bb701/work/18500901T0000Z/fcm_make_um/fcm-make.cfg -C /var/spool/jtmp/9265061.xcs00.FRxKkG/fcm_make_um.18500901T0000Z.u-bb701Gyr3Nw -j 6 --archive # return-code=9
2018-11-05T13:09:28Z CRITICAL - failed/EXI

I found this ticket which suggest removing the prebuild. Doing so causes a different error.

Cheers, Alison

