Opened 9 years ago

Closed 9 years ago

#700 closed help (fixed)

Seg faults in converting N48 to N96

Reported by: watson Owned by: willie
Component: UM Model Keywords: HadGAM, 6.1, segmentation fault
Cc: Platform:
UM Version: 6.1

Description

Hello,

In continuing to try to get an N96 L60 atmosphere-only version of the UM running on Hector, I have been attempting to convert an N48 L60 version 6.1 job xfnya to N96. My job is xgigm, in which I have changed the ancillaries being used to their standard N96 versions where necessary and changed the number of land points (I changed the ancils for ozone, soil moisture, deep soil temp, Soil:VSMC…, SST, sea ice, land-sea mask, orography and land fraction). The start dump seems to have been reconfigured to N96 resolution successfully (no error messages were given and the fields in the output start dump look sensible). However, attempting to run the model gives the following error message (xgigm000.xgigm.d11275.t114911.leave):

xgigm: Starting run
_pmii_daemon(SIGCHLD): [NID 01115] [c9-1c2s2n3] [Sun Oct 2 11:55:28 2011] PE 54 exit signal Segmentation fault
_pmii_daemon(SIGCHLD): [NID 01113] [c9-1c2s3n3] [Sun Oct 2 11:55:28 2011] PE 6 exit signal Segmentation fault
[NID 01115] 2011-10-02 11:55:29 Apid 1277253: initiated application termination
diff: /work/n02/n02/watson/tmp/tmp.hector-xe6-13.29874/xgigm.xhist: No such file or directory
qsexecute: Copying /work/n02/n02/watson/xgigm/xgigm.thist to backup thist file /work/n02/n02/watson/xgigm/xgigm.thist_keep
xgigm: Run failed

I searched old tickets for similar errors - in ticket 406 it was suggested to include the script /home/n02/n02/jwc/um/vn6.1/mods/core.mu and source file /home/n02/n02/jwc/um/vn6.1/mods/debug.mf77 to give more informative error output - I put these in Sub-Model Independent → Compilation and Modifications → Script Inserts and Modifications and in Modifications for the Model respectively. However, the job now fails at the compilation stage (xgigm000.xgigm.d11275.t113711.comp.leave):

*

Starting script : qsmain
Starting time : Sun Oct 2 11:40:14 BST 2011

*

Files copied to modset /home/n02/n02/watson/umui_runs/xgigm-275113707 are:

qsmain(6.233): * Starting Fortran77 nupdate

Completed with 1 error(s) and 5 warning(s).
/work/n02/n02/watson/tmp/tmp.hector-xe6-18.2325/modscr_xgigm/qsmain: Model xgigm - Error(s) in Fortran77 code update step
*

Ending script : qsmain
Completion code : 25
Completion time : Sun Oct 2 11:40:34 BST 2011

*

And towards the end of the .comp.leave file:

qsmain: %UPDATEF77% output follows:-

PUMSCM: version 1.21 (2003/06/19). © Met Office
Warning : Multiple insertions after line UPS0F601.100 in deck ATMSTEP2
! P. Selwood. UPS0F601.100
Warning : Multiple insertions after line AZG0F503.9 in deck ATMSTEP2

& first_atmstep_call AZG0F503.9

Warning : Multiple insertions before line ATMSTEP2.82 in deck ATMSTEP2
! Code Description: ATMSTEP2.82
Error : Cannot insert text after deleted line ACA1F501.31 in deck ATMSTEP2 (Modset DEBUG)
Processed ATMSTEP2 : 1 error(s) , 3 warning(s).

Warning : Multiple insertions before line CHKIDEAL.30 in deck CHKIDEAL
! CHKIDEAL.30
Processed CHKIDEAL : 0 error(s) , 1 warning(s).

Warning : Multiple insertions before line SETCONA2.36 in deck SETCONA2
!LL SETCONA2.36
Processed SETCONA2 : 0 error(s) , 1 warning(s).

Does anyone have any suggestions about how I should go about identifying the source of the seg faults?

Cheers,

Peter

Change History (9)

comment:1 Changed 9 years ago by willie

Hi Peter,

There is a standard N96 Global job for HECToR : user 'umui' job 'xczia'. This might be a quicker approach.

I can't see your leave files. Could you do

chmod -R g+rx .

(note the dot) in your home directory and also in your /work.

Regards,

Willie

comment:2 Changed 9 years ago by watson

Dear Willie,

I ran the chmod command as you said.

I have tried converting xczia to L60 (my job xgigq), but without success so far, as I eventually ran into another seg fault which I don't know how to fix. I made changes based on performing a difference with my job xgigm which is set up to be L60 (changing level numbers and ancillary files appropriately) and switched off the sulphur cycle and aerosol modelling (since I don't have requisite L60 ancils) and the global river routing model (which causes an error in the dump reconfiguration, with an error message relating to the number of model levels for some reason).

Then running the model gives the following error (xgigq000.xgigq.d11276.t165133.leave):

xgigq: Starting run
_pmii_daemon(SIGCHLD): [NID 01586] [c3-0c0s6n2] [Mon Oct 3 17:31:15 2011] PE 8 exit signal Segmentation fault
[NID 01586] 2011-10-03 17:31:15 Apid 1283891: initiated application termination
diff: /work/n02/n02/watson/tmp/tmp.hector-xe6-13.30779/xgigq.xhist: No such file or directory
qsexecute: Copying /work/n02/n02/watson/xgigq/xgigq.thist to backup thist file /work/n02/n02/watson/xgigq/xgigq.thist_keep
xgigq: Run failed
*

Ending script : qsexecute
Completion code : 137
Completion time : Mon Oct 3 17:31:18 BST 2011

*

The diff between xgigm and xczia showed up quite a lot of differences in the model set ups. I would like to keep my job as close to that which xgigm is copied from as possible, so it would be preferable to try to fix xgigm.

Cheers,

Peter

comment:3 follow-up: Changed 9 years ago by willie

Hi Peter,

The main problem is that you have no output. You are using SHEKAR_MOD/flush.mf77 which calls flush_all_pp, but I can't see it defined in the modset. It would be useful if you could replace this with my /home/n02/n02/wmcginty/modsets/flush.f77 which is simpler.

A second issue is that there are STASH errors. If you go to STASH > user STASH and press the prognostics button, number of errors will appear. These stem from the ~sosprey/umui/Ian_Edmond users STASH master file. These ought to be corrected.

Finally you should remove the /work xgigm directory and do a clean build and run.

I note that you have solved the modset errors you were getting.

Regards,

Willie

comment:4 in reply to: ↑ 3 Changed 9 years ago by watson

Dear Willie,

Thanks. I have found that if I remove the ~sosprey/umui/Ian_Edmond STASH file that the reconfiguration fails, and I will need to ask Scott Osprey about the reason for the errors when he gets back to work in a week's time.

Keeping the STASH file in place and allowing the model to run gives the error below, which comes after information about the vertical velocity at the first time step is outputted (xgigm000.xgigm.d11281.t174121.leave):

==============================================

initial Absolute Norm : 170864483253.10263
GCR( 2 ) failed to converge in 100 iterations.
Final Absolute Norm : 16904499.985889442
==============================================

WARNING *
Conservation enforcement failed
Run continuing using best estimate
WARNING *

non-conservation for field 1.

WARNING *
Conservation enforcement failed
Run continuing using best estimate
WARNING *

non-conservation for field 2.

WARNING *
Conservation enforcement failed
Run continuing using best estimate
WARNING *

non-conservation for field 3.

WARNING q_POS : 5877 points were less than 0.E+0 and the scaling factor has been reset to 1
WARNING q_POS : VALUES RESET NON CONSERVATIVELY MANNER

I also tried running the model with the vertical velocity set to zero in the reconfiguration, but this did not change the above error.

comment:5 Changed 9 years ago by willie

Hi Peter,

I have looked at the Ian Edmond STASH again: the grid codes are the same as the STASH items that appear in later versions of the UM. So it looks like 6.1 needs a mod set to handle these.

You also need to switch off MASS archive - this is a Met Office facility. See Sub model Indep > Post proc > using the MASS archive. Do you also need post processing? This can be switched off at Post Proc > Main Switch and questions. At least you should select "no archiving system" on that page.

At the moment the model is failing on time step 1 with the GCR(2) failed to converge message. I think there are other issues that are leading to this problem.

Regards,

Willie

comment:6 Changed 9 years ago by willie

  • Owner changed from um_support to willie
  • Status changed from new to accepted

comment:7 Changed 9 years ago by watson

Thanks Willie. I don't seem to be able to do anything in the MASS archiving window - is this automatically disabled for jobs on Hector? I switched off post-processing - I thought I would try submitting the job again just to see if it made any difference, but it fails with the same error as before.

I checked again that all the ancillaries I am using are N96 - however, I have not changed the spectral files (under Atmos → Ancils → Other → Spectral files) from those used by the N48 run I copied, since I was told by someone that the resolution shouldn't matter for these. Also I can't open them in xconv to check them. Is what I was told correct or might I need to change these files, in which case do you know where HadGEM1 N96 spectral files would be located?

comment:8 Changed 9 years ago by willie

Peter,

If you switch off the post processing this disables the MASS page too. The spectral files are ASCII text: they're just Fortran namelists.

Regards,

Willie

comment:9 Changed 9 years ago by willie

  • Resolution set to fixed
  • Status changed from accepted to closed
Note: See TracTickets for help on using tickets.