Opened 9 years ago

Closed 8 years ago

#488 closed help (fixed)

Core files on MONSooN

Reported by: kipling Owned by: lois
Component: MONSooN Keywords:
Cc: Platform:
UM Version: 7.3

Description

Using version 7.1 on HECToR, I was able to get core files when the model crashed by adding fcm:um_br/dev/ros/VN7.1_generate_core (which just adds "ulimit -c unlimited" to qsexecute); these would permit post-mortem debugging of the crash.

However, using version 7.3 on MONSooN, this doesn't appear to work with the equivalent branch (fcm:um_br/dev/ros/VN7.3_generate_core); e.g. my job xfgla was crashing with SIGFPE but not producing a core file.

A little digging suggests this relates to the SIGNAL_TRAP(0) call in UM_SHELL, which is only enabled for the IBM arch. Changing this to SIGNAL_TRAP(1) does lead to a core file being produced, but apparently from the wrong thread:

$ dbx bin/xfgla.exe core 
Type 'help' for help.
warning: The core file is not a fullcore. Some info may
not be available.
[using memory image in core]
reading symbolic information ...

Floating point exception in _event_sleep at 0x90000000036baa4
0x90000000036baa4 (_event_sleep+0x108) e8410028          ld   r2,0x28(r1)
(dbx) where
_event_sleep(??, ??, ??, ??, ??, ??) at 0x90000000036baa4
_p_sigtimedwait(??, ??, ??) at 0x900000000370bc4
pth_signal.sigwait(??, ??) at 0x900000000371cd4
pm_async_thread(??) at 0x900000000d7e5c8

(while the .leave file appears to have a correct backtrace, in this case from STASH).

My understanding is that on AIX a "fullcore" file is required for cross-thread debugging; however the AIX documentation suggests these can only be enabled at a system (rather than per-user or per-process) level…

(Removing the SIGNAL_TRAP call altogether leads to the SIGFPE being silently ignored.)

Is there a known way to get usable core files from UM7.3 on MONSooN, or should I take this up with their tech people?

Change History (2)

comment:1 Changed 9 years ago by lois

  • Owner changed from um_support to lois
  • Status changed from new to assigned

We could have a go at looking into this Zak but with some CMS people on leave/courses next week it may be quicker to see if the Met Office people can find you the solution for MONSooN quickly. If you don't get the core files you need then we will see what we can do.

Lois

comment:2 Changed 8 years ago by lois

  • Resolution set to fixed
  • Status changed from assigned to closed
Note: See TracTickets for help on using tickets.