Opened 4 years ago

Closed 4 years ago

#1637 closed help (fixed)

Segmentation fault on Cray

Reported by: JuwonKim Owned by: um_support
Component: UM Model Keywords:
Cc: Platform: MONSooN
UM Version: 8.5

Description

I encountered the following message when a run killed.

======================================================
Process died with signal 11: 'Segmentation fault'
Forcing core dumps of ranks 607, 540, 606, 0
_pmiu_daemon(SIGCHLD): [NID 00098] [c0-0c1s8n2] [Wed Aug 26 18:42:43 2015] PE RANK 604 exit signal Killed
[NID 00098] 2015-08-26 18:42:43 Apid 155115: initiated application termination
xlpug: Run failed
======================================================

compile job: xlpuf
compile leave file: jukim@xcml00:/home/jukim/output/xlpuf000.xlpuf.d15229.t114120.comp.leave

run job: xlpug
run leave file: jukim@xcml00:/home/jukim/output/xlpug000.xlpug.d15238.t155225.leave

I'm not sure whether the configuration for compiling is correct or not.
I mean the setting for memory, module load, etc.

Thanks,

Juwon

Change History (11)

comment:1 Changed 4 years ago by grenville

Juwon

I have taken a copy of this job for testing. I ran the job for 1249 time steps (I had switched off hyperthreading) and it failed with the same error as did your job. We are looking for the solution to the problem.

Grenville

comment:2 Changed 4 years ago by grenville

Juwon

Please increase the GCOM collectives limit to 2000 in model selection→Independent section options → miscellan..

My copy of your job ran OK for 1800 timesteps with this setting (then stopped because it "Reached end of atmosphere LBC file").

I changed Summation type to Double-Double Precision Reproducible also, but that may not be necessary.

My experience was that running with hyperthreads on slowed the performance - you may want to test that out more fully.

Grenville

comment:3 Changed 4 years ago by JuwonKim

Dear Grenville

I increased the GCOM collectives limit to 2000 and switched off hyperthreads (but the number of OpenMP is 2). I ran the job for 399 time steps and it failed with the same error.
Do I have to compile again?

Run leave file: jukim@xcml00:/home/jukim/output/xlpug000.xlpug.d15250.t122120.leave

Juwon

comment:4 Changed 4 years ago by grenville

Juwon

I ran with Summation type "Fast but Non Reproducible" and my run failed at ts 1549 - please try changing to "Double-Double Precision Reproducible"

I see no reason why you need to recompile, but it won't do any harm to try.

Grenville

comment:5 Changed 4 years ago by JuwonKim

Dear Grenville

I still encountered the same error message with 1 openMP without hyperthreads, 2000 GCOM collectives limit and Double-double precision reproducible summation type.

The first 3hours run stopped for "Reached end of atmosphere LBC file" at 1800 timesteps like you, but error again for 3199 timesteps of CRUN.

Run leave file(stopped at 1800 timesteps): jukim@xcml00:/home/jukim/output/xlpug000.xlpug.d15259.t134604.leave

Run leave file(stopped at 3199 timesteps): jukim@xcml00:/home/jukim/output/xlpug000.xlpug.d15259.t213236.leave

I tried the same run many times, sometimes stopped earlyer than 1800 timesteps with error message. As I think my execution is something unstable. I'm not sure whether the configuration for compile is correct.

Juwon

Version 1, edited 4 years ago by JuwonKim (previous) (next) (diff)

comment:6 Changed 4 years ago by grenville

Juwon

I'm not ignoring this ticket — I don't yet know what's going wrong. I see you are running on MONSooN - have you found the solution?

Grenville

comment:7 Changed 4 years ago by JuwonKim

Dear Grenville

No, I just found there is no error for using IO server only during NRUN.
My run time length is 7hours and job time limit is 3hours. It doesn't finish within 3hours because the output size is so big. So I need to resubmit with CRUN option. But problem is I encounter the same error for CRUN. I don't know too.

Juwon

comment:8 Changed 4 years ago by JuwonKim

Dear Grenville

Error messages are different between comment5 and comment7.
For the error in comment5, it seems to be no problem when I use IO server together.
For the error in comment7, problem looks like number of times for radiation increments.
I don't yet test whether the CRUN ends normally after matching radiation increments timestep.

Thanks

Juwon

comment:9 Changed 4 years ago by JuwonKim

Dear Grenville

I've met another error message during CRUN.
I adapted the radiation increments timestep for CRUN.
The messages are the following:
======================================================================
lib-4171 : UNRECOVERABLE library error

An output list item is incompatible with its data edit-descriptor.

lib-4171 : UNRECOVERABLE library error

An output list item is incompatible with its data edit-descriptor.

Encountered during a sequential formatted WRITE to
Encountered during a sequential formatted WRITE to unit 6

unit 6

Fortran unit 6 is Fortran unit 6 is connected to
lib-4171 : UNRECOVERABLE library error

An output list item is incompatible with its data edit-descriptor.

connected to a sequential formatted text filea sequential formatted text file:

"/projects/dymecs/jukim/xlpuj/pe_output/xlpuj.fort6.pe23"

:

"/projects/dymecs/jukim/xlpuj/pe_output/xlpuj.fort6.pe26"

Current format: Current format:

Encountered during a sequential formatted WRITE to(A,L1)

unit 6
Fortran unit 6 is (A,L1)
connected to a sequential formatted text file


========================================================================

Run leave file: jukim@xcml00:/home/jukim/output/xlpuj000.xlpuj.d15288.t100111.leave
and jobID is xlpuj

Have you ever seen before?

Juwon

comment:10 Changed 4 years ago by JuwonKim

Dear Grenville

I found what the problem is.
There is no time matching between CRUN and the re-intialized pp file.
No more problems for this run, thanks.

Juwon

comment:11 Changed 4 years ago by grenville

  • Resolution set to fixed
  • Status changed from new to closed

Juwon

Glad you solved this.

Grenville

Note: See TracTickets for help on using tickets.