Opened 3 years ago

Closed 2 years ago

Last modified 2 years ago

#2035 closed help (fixed)

Unhelpful segmentation fault error

Reported by: mattjbr123 Owned by: um_support
Component: UM Model Keywords:
Cc: Platform: MONSooN
UM Version: 10.3

Description

Hi,
Having some trouble tracking down the source of segmentation faults I'm getting when trying to run suite u-af404 on MONSooN. I've turned the logging output up to max but this has seemingly just increased the volume of unhelpful messages.

I'm getting tons of messages like the following in the .err file:

 [215] exceptions: An exception was raised:11 (Segmentation fault)

with the only helpful things in the .out file that I could find being:

THREAD LEVEL REQUESTED is MPL_THREAD_MULTIPLE
THREAD LEVEL SET is MPL_THREAD_MULTIPLE
Application 1830424 exit codes: 11
Application 1830424 exit signals: Killed
Application 1830424 resources: utime ~273s, stime ~99s, Rss ~401408,
inblocks ~949341, outblocks ~4329128

and

IO: Switching file mode to local because there is no IO server

As far as I can tell it is failing in the first timestep, and possibly whilst it is doing STASH-y IO, but other than that doesn't give much to go on. Would you be able to advise on the best steps on tracking down the source of the error?

I've attached the full .out and .err files in case you can find anything more helpful in them. I couldn't find much helpful in the pe[0-25] log files either.

Many thanks,
Matt
Username: mabro
Project: solar
Suite: u-af404

Attachments (3)

job.out (281.4 KB) - added by mattjbr123 3 years ago.
job.2.out (281.4 KB) - added by mattjbr123 3 years ago.
job.err (1.0 MB) - added by mattjbr123 3 years ago.

Download all attachments as: .zip

Change History (14)

Changed 3 years ago by mattjbr123

Changed 3 years ago by mattjbr123

comment:1 Changed 3 years ago by grenville

Matt

Failing so quickly is usually an input data problem - what did you change since the last successful run?

Grenville

Changed 3 years ago by mattjbr123

comment:2 Changed 3 years ago by ros

Not sure if it's of any help; looking in the job.err file there is an error from the IOS_INIT routine. Looks like it's trying to IO servers but they're not active??

????????????????????????????????????????????????????????????????????????????????
???!!!???!!!???!!!???!!!???!!!       ERROR        ???!!!???!!!???!!!???!!!???!!!
?  Error code: 99
?  Error from routine: IOS_INIT:IOS_RUN
?  Error message: IO Server is not active
?  Error from processor: 0
?  Error number: 0
????????????????????????????????????????????????????????????????????????????????
Last edited 3 years ago by ros (previous) (diff)

comment:3 Changed 3 years ago by mattjbr123

This specific suite has never run successfully - but it is using the same dump file that it came with. It has been adapted from suite u-ae805, where I've added in build/run switches and some other bells and whistles.

I originally thought the same thing about the dump file, but have tried several ones and haven't had any luck there.

I have also seen the IO error and have changed the number of IO servers and openMP threads but that didn't help either - in fact when I had the IO servers turned on it gave me different errors saying that they couldn't be turned on or something to that effect.

At the moment I am running suite u-ad273 which is the same as this suite, but doesn't have any ensemble functionality and few other minor differences; to compare with the .out and .err files from this suite, to see if that sheds any light.

comment:4 Changed 3 years ago by mattjbr123

Could it be that it is trying to write lots of stuff to STASH but is running out of memory because there's no IO servers?

comment:5 Changed 3 years ago by mattjbr123

To rule out the input files as the problem, both u-af404 and u-ad273 have the same dump file and ancillaries and the latter ran successfully - are there any other data files that could be causing an issue?

comment:6 Changed 3 years ago by grenville

Matt

Thanks - we're less experienced than you with ensemble functionality in Rose - we're looking at this.

Grenville

comment:7 Changed 3 years ago by grenville

Matt

u-ad273 didn't run successfully(?)

/home/mabro/cylc-run/u-ad273/log/job/19810901T0000Z/atmos_main/01/job.out says:

????????????????????????????????????????????????????????????????????????????????
???!!!???!!!???!!!???!!!???!!! ERROR ???!!!???!!!???!!!???!!!???!!!
? Error code: 4
? Error from routine: INITIAL_4A
? Error message: INANCILA:integer header error
? Error from processor: 0
? Error number: 12
????????????????????????????????????????????????????????????????????????????????

Grenville

comment:8 Changed 3 years ago by mattjbr123

Yes, sorry, it did a few runs back (the logs should be tarred up in my
home dir?) except for hitting a walltime limit. Then I changed a few
things to make it more similar to u-af404, including removing n216 optional config and changing the processor decomposition. This introduced the error above for some reason, so I tried adding the n216 optional config back in to see if that was what introduced the problem, but obviously not - that's the run that is currently active.

I'm not sure exactly what I did to cause that specific error (looks
like it's something to do with the ancillary files?), but will have
time to look at it more this afternoon.

Matt

comment:9 Changed 3 years ago by mattjbr123

Sorry for the confusion here - u-ad273 is now running successfully at n216 resolution.

When I changed it back to n216 from n96 it was still using the n96 ancils (perhaps because I did not re-build/reconfigure the model originally, which I did this time around).

This indicates that it's something to do with the ancillary files (maybe), although I thought I had u-ad273 successfully running at n96 before, although I can't remember for certain. The u-ad273 run at n96 also did not throw up the same error as that of the u-af404 (which doesn't seem to give many error messages at all).

I will try running:
u-ad273 at n96
u-af404 at n216
to confirm whether or not it makes a difference for sure.

As I ran u-ad273 with the '—new' option, I think it deleted all the tarred up log files. In future I'll copy them out to a logs directory in my home dir on MONSooN.

comment:10 Changed 3 years ago by mattjbr123

Progress Update

Looks like u-ad273 ran at n216 still – it seems like you have to use the —new flag when running the suite to completely clear everything and force it to update which ancils it uses.

However u-af404 still crashed at n216 resolution at exactly the same point, so clearly the resolution is not the problem.

Comparing the 2 log files, pretty much the next thing the model does after u-af404 crashes is the nudging – so it might be this that's causing the problem.

There are also several extra UKCA STASH variables being requested in u-af404, even though they are also enabled in u-ad273. This would lead me to believe that the UKCA bit of the model being used is different in some way.

There is an extra fcm code source included in the build for both JULES and the atmos code in u-af404, both having the same name. Comparing the extracted source code shows that most of the differences are in the UKCA files (no surprise). More specifically it looks like the differences are to do with parameter scaling of some of the UKCA variables - it may be that this includes stuff you simply don't need that breaks the nudging – so a good first step would be to simply remove this source and see if it changes anything.

Having done this u-af404 still crashes in exactly the same place.

The UKCA scheme is enabled (with identical settings as far as I can tell) for both suites. However it can't be the scheme itself which is a problem, because u-ad273 is running fine with it – so again suggesting the problem may be introduced by these code differences, but the previous point suggests not…

Next things to try:
removing the extra STASH variables, disabling nudging, disabling UKCA scheme, even remove the small code modification required for the ensemble setup.

Let me know if you think of anything too!

comment:11 Changed 2 years ago by mattjbr123

  • Resolution set to fixed
  • Status changed from new to closed

Hi guys - I did fix this problem in the end.

By gradually removing all the differences between u-af404 and u-ad273 it turns out the problem was to do with some 'fcflags' and 'ldflags' that I had put there for debugging purposes for a previous problem. Removing these seemed to resolve whatever the problem was - although I am not sure why!
One of the flags was -g but I can't remember what the others were.

If you can think of why this would cause the above segmentation fault error it would be useful to know, but otherwise you can close this ticket.

And happy Christmas!

Last edited 2 years ago by mattjbr123 (previous) (diff)
Note: See TracTickets for help on using tickets.