Opened 19 months ago

Closed 9 months ago

#1744 closed help (answered)

v7.3 UKCA job seg-faulting at start-up with "Back-end never delivered its pid" atpAppSigHandler error

Reported by: gmann Owned by: um_support
Priority: normal Component: UM Model
Keywords: ukca Cc:
Platform: ARCHER UM Version: 7.3

Description

I am having a problem with my v7.3 UM-UKCA job failing with a strange error I have not seen
before — it seems to be compiling and reconfiguring OK but is crashing (I think immediately)
with a seg-fault — it's dumping a core and giving an error message I've not seen before which is
shown below.

The 2 jobs are xlypk and xlypn — they both seem to be suffering the same failure.

These two jobs are very similar — uses some updated codebase and additional diagnostics.

The job xlypt runs OK — see the leave file xlypt000.xlypt.d15320.t162348.leave in directory:

/work/n02/n02/gmann/UM_output_Files_11Nov2015_to_20Nov2015/

The error is below.

Has this error message been encountered before?

If you could give me any pointers as to what the likely cause of the problem is that would be much appreciated.

Many thanks for your help,

Cheers
Graham

atpAppSigHandler: Back-end never delivered its pid. Re-raising signal.
atpAppSigHandler: Back-end never delivered its pid. Re-raising signal.
_pmiu_daemon(SIGCHLD): [NID 03257] [c0-2c2s14n1] [Thu Nov 26 10:15:10 2015] PE RANK 50 exit signal Segmentation fault
[NID 03257] 2015-11-26 10:15:11 Apid 18838319: initiated application termination
_pmiu_daemon(SIGCHLD): [NID 03262] [c0-2c2s15n2] [Thu Nov 26 10:15:10 2015] PE RANK 153 exit signal Segmentation fault
_pmiu_daemon(SIGCHLD): [NID 03263] [c0-2c2s15n3] [Thu Nov 26 10:15:10 2015] PE RANK 185 exit signal Segmentation fault
_pmiu_daemon(SIGCHLD): [NID 03255] [c0-2c2s13n3] [Thu Nov 26 10:15:10 2015] PE RANK 6 exit signal Segmentation fault
_pmiu_daemon(SIGCHLD): [NID 03259] [c0-2c2s14n3] [Thu Nov 26 10:15:10 2015] PE RANK 98 exit signal Segmentation fault
_pmiu_daemon(SIGCHLD): [NID 03261] [c0-2c2s15n1] [Thu Nov 26 10:15:10 2015] PE RANK 122 exit signal Segmentation fault
_pmiu_daemon(SIGCHLD): [NID 03258] [c0-2c2s14n2] [Thu Nov 26 10:15:10 2015] PE RANK 74 exit signal Segmentation fault
_pmiu_daemon(SIGCHLD): [NID 03256] [c0-2c2s14n0] [Thu Nov 26 10:15:10 2015] PE RANK 28 exit signal Segmentation fault
xlypk: Run failed
*

Ending script : qsexecute
Completion code : 139
Completion time : Thu Nov 26 10:15:18 GMT 2015

*

/work/n02/n02/gmann/um/xlypk/bin/qsmaster: Failed in qsexecute in model xlypk
*

Starting script : qsfinal
Starting time : Thu Nov 26 10:15:30 GMT 2015

*

/work/n02/n02/gmann/um/xlypk/bin/qsfinal: Error in exit processing after model run
Failed in model executable

/work/n02/n02/gmann/um/xlypk/bin/qsfinal: Model xlypk - Error: No history files
*

Ending script : qsfinal
Completion code : 135
Completion time : Thu Nov 26 10:15:31 GMT 2015

*

Change History (5)

comment:1 Changed 19 months ago by simon

Hi,

Do you have a copy of the coredump?

Simon.

comment:2 Changed 19 months ago by gmann

Hi Simon,
Thanks for your help — and sorry for my slow reply.

Yes the core produced from the two seg faulting jobs is at:

/work/n02/n02/gmann/um/xlypk/core

and

/work/n02/n02/gmann/um/xlypn/core

I wondered if it might be a memory problem as I'd added a few extra diags (size-resolved removal fluxes for nitrate and ammonium) which weren't in the previous job that ran OK.

So I reduced the number but it still failed.

Any help you can give here would be much appreciated.

Many thanks for your help,

Cheers
Graham

comment:3 Changed 18 months ago by ih280

Hello,

I am getting the same error message for my job xlplq and was wondering whether there is any advice on how to proceed.
/home/n02/n02/ih280/output/xlplq000.xlplq.d16014.t112336.comp.leave
/home/n02/n02/ih280/output/xlplq000.xlplq.d16014.t112336.leave
with the data being in /work/n02/n02/ih280/um/xlplq

Best wishes,
Ines

comment:4 Changed 18 months ago by grenville

Ines

The atp problem will be fixed if you use atp/1.8.3

module unload atp
module load apt/1.8.3

to the SUBMIT file

I have a hand edit for UM 8.2 (/home/grenville/umui_jobs/hand_edits/atp-version) which you could modify to UM 7.3. That won't fix the seg fault, but will give a stack trace which may help track it down.

Alternatively, and maybe easier, add these two lines to your .profile on ARCHER.

Grenville

comment:5 Changed 9 months ago by ros

  • Resolution set to answered
  • Status changed from new to closed

Ticket discussion moved to email.

Note: See TracTickets for help on using tickets.