Opened 4 years ago

Closed 3 years ago

#1779 closed error (answered)

UM 10.3 reconfiguration segmentation faults

Reported by: swr05npk Owned by: annette
Component: UM Reconfiguration Keywords: ATP
Cc: Platform: ARCHER
UM Version: 10.3

Description

Note: I selected "10.2" for the UM version as there is no entry for 10.3.

I have tried to run two Rose suites on ARCHER:
u-ab304, which is a copy of the Met Office standard GA6 N96 AMIP job
u-ab340, which is a copy of Annette's "simple N48 job"

Both suites reach the reconfiguration stage, whereupon both promptly segfault. Here is an example error trace (from u-ab304 in this case):

[0] exceptions: Program is "/work/n02/n02/pappas/cylc-run/u-ab304/share/fcm_make_um/build-recon/bin/um-recon.exe"
Warning in umPrintMgr: umPrintExceptionHandler : Handler Invoked
[0] exceptions: No backtrace function defined for this platform
[NID 00170] 2015-12-18 13:20:54 Apid 19447117: initiated application termination
[FAIL] um-recon # return-code=137
Received signal ERR
cylc (scheduler - 2015-12-18T13:20:55Z): CRITICAL Task job script received signal ERR at 2015-12-18T13:20:55Z
cylc (scheduler - 2015-12-18T13:20:55Z): CRITICAL recon.19880901T0000Z failed at 2015-12-18T13:20:55Z

There are log files in /home/n02/n02/pappas/cylc-run/u-ab304 and /home/n02/n02/pappas/cylc-run/u-ab340. I can't see any useful information in them, though.

How do I get more information from the suite about these errors? I can't find a core dump anywhere, nor do I know how to turn on debugging in a Rose suite. Given that I took a straight copy of Annette's suite and just changed the username, I would have expected it to work.

Change History (6)

comment:1 Changed 4 years ago by annette

  • Component changed from UM Model to UM Reconfiguration
  • Owner changed from um_support to annette
  • Status changed from new to assigned
  • UM Version changed from 10.2 to 10.3

Hi Nick,

I'm not sure why your suites have failed either. I also would have expected the copy of my suite to work. I have what should be an identical version in the puma repository, and we had several people run this on the UM course last week.

I will investigate and get back to you.

Annette

comment:2 Changed 4 years ago by annette

Nick,

I have re-run my copy of the suite and I don't get any errors.

I think I might know why you don't get an error message though - looks like we are missing a setting in the Archer config file.

In your roses suite directory, edit the file app/fcm_make/rose-app.conf and set the following variable:

config_root_path=/home/annette/um/work_dir/vn10.3_archer

Now re-submit the run (with compilation on), and let me know if you get any extra information.

If this doesn't work, Grenville was experimenting with ATP trace-backs in Rose suites, but I'm not sure if he is working this week or not.

Annette

comment:3 Changed 4 years ago by annette

  • Status changed from assigned to pending

comment:4 Changed 4 years ago by swr05npk

Hi Annette,

I have changed that path in both suites. Both still segfault, but with a slightly different error message:

[0] exceptions: An exception was raised:11 (Segmentation fault)
[0] exceptions: whilst in a serial region
[0] exceptions: Task had pid=45292 on host nid02425
[0] exceptions: Program is "/work/n02/n02/pappas/cylc-run/u-ab304/share/fcm_make_um/build-recon/bin/um-recon.exe"
Warning in umPrintMgr: umPrintExceptionHandler : Handler Invoked
[0] exceptions: Data address (si_addr): (nil) eip/rip: (nil)
[0] exceptions: Not calling backtrace_symbols() due to SIGSEGV
[0] exceptions: [backtrace]: (  0) : Address: (nil) 
[0] exceptions: [backtrace]: (  0) : line information unavailable error code: 2 (* Cannot Locate *)
[0] exceptions: [backtrace]: (  1) : Address: 0x40e3d1 
[0] exceptions: [backtrace]: (  1) : line information unavailable error code: 2 (* Cannot Locate *)
[0] exceptions: [backtrace]: (  2) : Address: 0x40e8a0 
[0] exceptions: [backtrace]: (  2) : line information unavailable error code: 2 (* Cannot Locate *)
[0] exceptions: [backtrace]: (  3) : Address: 0x74dbb0 
[0] exceptions: [backtrace]: (  3) : __restore_rt
 in file sigaction.c line 0
[0] exceptions: 
[0] exceptions: To find the source line for an entry in the backtrace;
[0] exceptions: run addr2line --exe=</path/too/executable> <address>
[0] exceptions: where address is given as [0x<address>] above
[0] exceptions: 
[NID 02425] 2015-12-21 16:12:33 Apid 19481825: initiated application termination
[FAIL] um-recon # return-code=137
Received signal ERR

Running the add2line command suggested in the output does not produce any useful output, probably because I do not have debugging symbols switched on (?)

comment:5 Changed 4 years ago by grenville

Nick

The UM signal handling in the later versions is interfering with ATP in some way - I can get ATP to function correctly if I remove the call to umSetApplicationExceptions in control/top_level/um_config.F90 appInit (this was Paul Selwood's suggestion). You need atp/1.8.3 on ARCHER - the default (1.7.5 gives errors - I'm sure I told ARCHER about this).

This may help a bit. I still don't know why the model isn't generating a core file.

Grenville

comment:6 Changed 3 years ago by annette

  • Keywords ATP added
  • Resolution set to answered
  • Status changed from pending to closed

Nick,

I am closing this ticket due to lack of activity. You can, of course, reopen it if you require further assistance.

Annette

Note: See TracTickets for help on using tickets.