#2411 closed help (fixed)

Error during the main atmos step on ARCHER

Reported by: nfreychet Owned by: ros
Component: UM Model Keywords:
Cc: Platform: ARCHER
UM Version: 10.7

Description

Hello,

I am running the suite u-av080 on ARCHER. During the atmos_main step before the model can even start, I have an error I can't understand:

_pmiu_daemon(SIGCHLD): [NID 00149] [c0-0c2s5n1] [Thu Feb 22 11:50:06 2018] PE RANK 12 exit signal Segmentation fault
[15] exceptions: [backtrace]: (  7) : line information unavailable error code: 10 (Unable to get function name)
[15] exceptions: [backtrace]: (  8) : Address: [0x02bc4743] 
[141] exceptions: [backtrace]: (  1) : line information unavailable error code: 10 (Unable to get function name)
[141] exceptions: [backtrace]: (  2) : Address: [0x00419750] 
[FAIL] um-atmos # return-code=137
Received signal ERR

I cannot find any clear ERROR message in the outputs. The input files seem to be read correctly and there's no indication of a missing variable I can see.

The full output on ARCHER: /work/n02/n02/nfreyche/cylc-run/u-av080/log/job/19880901T0000Z/atmos_main/

And I seem to have enough space on the /work of ARCHER so I don't think it can be the cause of the problem.

Cheers,
Nico

Change History (6)

comment:1 Changed 20 months ago by ros

Hi Nico,

Does the suite run successfully without your code modifications?

Cheers,
Ros.

comment:2 Changed 20 months ago by nfreychet

Hi Ros,

So I tried to run the same suite without modification (u-av396) (I still had to put ARCHER parameters and to update the postproc section but it shouldn't impact the main atmo step). But it still crashes, not with the same error though:

[187] exceptions: [backtrace]: (  4) : Address: [0x00417ceb] 
[189] exceptions: [backtrace]: (  3) : signal_do_backtrace in file /fs2/n02/n02/nfreyche/cylc-run/u-av396/share/fcm_make_um/preprocess-atmos/src/um/src/control/c_code/exceptions/exceptions.c line 267
[189] exceptions: [backtrace]: (  4) : Address: [0x00417ceb] 
[NID 04455] 2018-02-23 11:12:22 Apid 30099524: initiated application termination
[119] exceptions: [backtrace]: (  4) : line information unavailable error code: 10 (Unable to get function name)
[119] exceptions: [backtrace]: (  5) : Address: [0x0272c770] 
[35] exceptions: [backtrace]: (  4) : line information unavailable error code: 10 (Unable to get function name)
[35] exceptions: [backtrace]: (  5) : Address: [0x0272c770] 
[FAIL] um-atmos # return-code=137
Received signal ERR

In the log.err there are a lot of "exceptions" messages, not sure how bad it is.

EDIT: I just saw that I didn't added the "module load cdt/15.11" in my suite as I should have. I will try with that to see if it solve some problems.

Nico

Last edited 20 months ago by nfreychet (previous) (diff)

comment:3 Changed 20 months ago by nfreychet

So after adding the "module load cdt/15.11" I'm back to the same error than in my modify suite:

[15] exceptions: [backtrace]: (  3) : line information unavailable error code: 10 (Unable to get function name)
[15] exceptions: [backtrace]: (  4) : Address: [0x00417ceb] 
_pmiu_daemon(SIGCHLD): [NID 00850] [c4-0c1s4n2] [Fri Feb 23 12:13:46 2018] PE RANK 7 exit signal Segmentation fault
_pmiu_daemon(SIGCHLD): [NID 00858] [c4-0c1s6n2] [Fri Feb 23 12:13:46 2018] PE RANK 21 exit signal Segmentation fault
[15] exceptions: [backtrace]: (  4) : line information unavailable error code: 10 (Unable to get function name)
[15] exceptions: [backtrace]: (  5) : Address: [0x0272f170] 
[NID 00850] 2018-02-23 12:13:46 Apid 30099723: initiated application termination
[35] exceptions: [backtrace]: (  1) : line information unavailable error code: 10 (Unable to get function name)
[35] exceptions: [backtrace]: (  2) : Address: [0x00419750] 
[179] exceptions: [backtrace]: (  1) : line information unavailable error code: 10 (Unable to get function name)
[179] exceptions: [backtrace]: (  2) : Address: [0x00419750] 
[177] exceptions: [backtrace]: (  1) : line information unavailable error code: 10 (Unable to get function name)
[177] exceptions: [backtrace]: (  2) : Address: [0x00419750] 
[189] exceptions: [backtrace]: (  1) : line information unavailable error code: 10 (Unable to get function name)
[191] exceptions: [backtrace]: (  1) : line information unavailable error code: 10 (Unable to get function name)
[189] exceptions: [backtrace]: (  2) : Address: [0x00419750] 
[191] exceptions: [backtrace]: (  2) : Address: [0x00419750] 
[165] exceptions: [backtrace]: (  1) : line information unavailable error code: 10 (Unable to get function name)
[165] exceptions: [backtrace]: (  2) : Address: [0x00419750] 
[FAIL] um-atmos # return-code=137
Received signal ERR

Could it be due to wrong ancil files? (but as I haven't change anything, why should I have a problem if the suite was originally working?)

Otherwise, is there a reference suite (AMIP style) to run on ARCHER with vn10.7? That would help maybe.

Cheers,

Nico

comment:4 Changed 20 months ago by ros

  • Owner changed from um_support to ros
  • Status changed from new to accepted

Hi Nico,

Please use the following set of modules:

module load cray-netcdf/4.4.1.1
module load cray-hdf5/1.10.0.1
module swap cray-mpich/7.5.5 cray-mpich/7.2.6

Please do not use the cdt module.

The original suite then runs successfully.

Cheers,
Ros.

comment:5 Changed 20 months ago by nfreychet

Hi Ros,

I made the required changes and it works well now.

Thanks a lot for your help!
You can close the ticket.

Cheers,

Nico

comment:6 Changed 20 months ago by ros

  • Resolution set to fixed
  • Status changed from accepted to closed

Great. Thanks for letting us know.

Cheers,
Ros.

Note: See TracTickets for help on using tickets.