Opened 5 years ago
Closed 4 years ago
#1744 closed help (answered)
v7.3 UKCA job seg-faulting at start-up with "Back-end never delivered its pid" atpAppSigHandler error
Reported by: | gmann | Owned by: | um_support |
---|---|---|---|
Component: | UM Model | Keywords: | ukca |
Cc: | Platform: | ARCHER | |
UM Version: | 7.3 |
Description
I am having a problem with my v7.3 UM-UKCA job failing with a strange error I have not seen
before — it seems to be compiling and reconfiguring OK but is crashing (I think immediately)
with a seg-fault — it's dumping a core and giving an error message I've not seen before which is
shown below.
The 2 jobs are xlypk and xlypn — they both seem to be suffering the same failure.
These two jobs are very similar — uses some updated codebase and additional diagnostics.
The job xlypt runs OK — see the leave file xlypt000.xlypt.d15320.t162348.leave in directory:
/work/n02/n02/gmann/UM_output_Files_11Nov2015_to_20Nov2015/
The error is below.
Has this error message been encountered before?
If you could give me any pointers as to what the likely cause of the problem is that would be much appreciated.
Many thanks for your help,
Cheers
Graham
atpAppSigHandler: Back-end never delivered its pid. Re-raising signal.
atpAppSigHandler: Back-end never delivered its pid. Re-raising signal.
_pmiu_daemon(SIGCHLD): [NID 03257] [c0-2c2s14n1] [Thu Nov 26 10:15:10 2015] PE RANK 50 exit signal Segmentation fault
[NID 03257] 2015-11-26 10:15:11 Apid 18838319: initiated application termination
_pmiu_daemon(SIGCHLD): [NID 03262] [c0-2c2s15n2] [Thu Nov 26 10:15:10 2015] PE RANK 153 exit signal Segmentation fault
_pmiu_daemon(SIGCHLD): [NID 03263] [c0-2c2s15n3] [Thu Nov 26 10:15:10 2015] PE RANK 185 exit signal Segmentation fault
_pmiu_daemon(SIGCHLD): [NID 03255] [c0-2c2s13n3] [Thu Nov 26 10:15:10 2015] PE RANK 6 exit signal Segmentation fault
_pmiu_daemon(SIGCHLD): [NID 03259] [c0-2c2s14n3] [Thu Nov 26 10:15:10 2015] PE RANK 98 exit signal Segmentation fault
_pmiu_daemon(SIGCHLD): [NID 03261] [c0-2c2s15n1] [Thu Nov 26 10:15:10 2015] PE RANK 122 exit signal Segmentation fault
_pmiu_daemon(SIGCHLD): [NID 03258] [c0-2c2s14n2] [Thu Nov 26 10:15:10 2015] PE RANK 74 exit signal Segmentation fault
_pmiu_daemon(SIGCHLD): [NID 03256] [c0-2c2s14n0] [Thu Nov 26 10:15:10 2015] PE RANK 28 exit signal Segmentation fault
xlypk: Run failed
*
Ending script : qsexecute
Completion code : 139
Completion time : Thu Nov 26 10:15:18 GMT 2015
*
/work/n02/n02/gmann/um/xlypk/bin/qsmaster: Failed in qsexecute in model xlypk
*
Starting script : qsfinal
Starting time : Thu Nov 26 10:15:30 GMT 2015
*
/work/n02/n02/gmann/um/xlypk/bin/qsfinal: Error in exit processing after model run
Failed in model executable
/work/n02/n02/gmann/um/xlypk/bin/qsfinal: Model xlypk - Error: No history files
*
Ending script : qsfinal
Completion code : 135
Completion time : Thu Nov 26 10:15:31 GMT 2015
*
Change History (5)
comment:1 Changed 5 years ago by simon
comment:2 Changed 5 years ago by gmann
Hi Simon,
Thanks for your help — and sorry for my slow reply.
Yes the core produced from the two seg faulting jobs is at:
/work/n02/n02/gmann/um/xlypk/core
and
/work/n02/n02/gmann/um/xlypn/core
I wondered if it might be a memory problem as I'd added a few extra diags (size-resolved removal fluxes for nitrate and ammonium) which weren't in the previous job that ran OK.
So I reduced the number but it still failed.
Any help you can give here would be much appreciated.
Many thanks for your help,
Cheers
Graham
comment:3 Changed 5 years ago by ih280
Hello,
I am getting the same error message for my job xlplq and was wondering whether there is any advice on how to proceed.
/home/n02/n02/ih280/output/xlplq000.xlplq.d16014.t112336.comp.leave
/home/n02/n02/ih280/output/xlplq000.xlplq.d16014.t112336.leave
with the data being in /work/n02/n02/ih280/um/xlplq
Best wishes,
Ines
comment:4 Changed 5 years ago by grenville
Ines
The atp problem will be fixed if you use atp/1.8.3
module unload atp
module load apt/1.8.3
to the SUBMIT file
I have a hand edit for UM 8.2 (/home/grenville/umui_jobs/hand_edits/atp-version) which you could modify to UM 7.3. That won't fix the seg fault, but will give a stack trace which may help track it down.
Alternatively, and maybe easier, add these two lines to your .profile on ARCHER.
Grenville
comment:5 Changed 4 years ago by ros
- Resolution set to answered
- Status changed from new to closed
Ticket discussion moved to email.
Hi,
Do you have a copy of the coredump?
Simon.