Opened 7 years ago

Closed 6 years ago

#1054 closed help (fixed)

segmentation fault in HadGEM2 v 6.6.3

Reported by: jcrook Owned by: willie
Component: UM Model Keywords:
Cc: Platform: HECToR
UM Version: 6.6.3

Description

I have a run xgezd (user jcrook) which is based on the RCP4.5 job xgtee (user eelsj). The run xgtee was started at 2010 and has been going for 30 years and is still going fine. The run xgezd is using the xgtee executable and the xgtee start dump for Jan 2020. The only difference between the runs is that I have changed the albedo of C3 and C4 grasses to be 0.08 bigger than they were. The run xgezd ran for 3.5 years and then crashed. I have had a look at some of the diagnostics (eg. 1.5m temperature, some of the radiation fields, soil moisture), and cannot see anything that looks odd. I can't see anything obvious in the .leave file either. Any ideas where to look?

Change History (23)

comment:1 Changed 7 years ago by willie

  • Owner changed from um_support to willie
  • Status changed from new to accepted

Hi Julia,

There is a segmentation fault after 11853 time steps. Other differences between the jobs are

  • The sulphur cycle chemical oxidants differ
  • The TTRIFMon STASH profile has been replaced by TTRIFdyn

If you let me have read permission on the core file in /work/n02/n02/jcrook/xgezd I may be able to get a little further.

Regards,

Willie

comment:2 Changed 7 years ago by jcrook

Both eelsj and myself are manually updating the sulphur cycle chemical oxidants ancillary file every decade so xgtee did have the same ancillary when it was running from 2020 to 2030. The TTRIFMon profile was created after I took a copy of xgtee and is actually set up the same as TTRIFdyn so these should not make a difference.

I used gdb to look at the core file and it implied there was a problem with the dynamics advection but I don't know what to do about it:

Program terminated with signal 11, Segmentation fault.
#0 0x0000000000b96c99 in interpolation (data_in1=Cannot access memory at addres s 0x20007ffff5a34008
)

at /home/n02/n02/eelsj/xgtee/ummodel/ppsrc/UM/atmosphere/dynamics_advection/ interpolation.f90:1053

1053 n_sendto(irecv) = n_sendto(irecv) + 1

I have given everyone read access to core.

comment:3 Changed 7 years ago by willie

Hi Julia,

I've looked through previous tickets: the one most likely to be relevant is #596. Perhaps you could consider switching off "bit comparable NRUN required" in section 19.

You might also consider returning the albedos to their original values to see if that works.

Regards,

Willie

comment:4 Changed 7 years ago by jcrook

We copied a job from someone else as a basis for all our jobs and that had this 'bit comparable NRUN required' switched on. We have not changed anything to do with vegetation until now that I want to modify albedos. This means our control run (xgtee) has this switch on also. We will be running each job for 90 model years and we have to modify the sulphur chemical oxidants file manually every decade which means we have to do a NRUN at the start of every decade and then do CRUN to the end of the decade. I am concerned that if I switch this off it wont give the same results as with it switched on in which case I wouldn't be able to use xgtee as my control run.

I have set a run going which is a copy of xgezd but with the albedos set back to their original values. I will let you know how far it gets - I have no reason to believe it wont run ok.

comment:5 Changed 7 years ago by jcrook

Well I was proved wrong. My run xgeze did crash with a segmentation fault but after only 1.5 years. This time it crashed in a different place but it still looks like an invalid memory address:
Core was generated by `/work/n02/n02/eelsj/xgtee/bin/xgtee.exe'.
Program terminated with signal 11, Segmentation fault.
#0 calc_3d_cca (np_field=Cannot access memory at address 0x20007ffff4d1d608
)

at /home/n02/n02/eelsj/xgtee/ummodel/ppsrc/UM/atmosphere/convection/calc_3d_cca-cal3dcca.f90:126

126 P_CLOUD_TOP = p_layer_boundaries(I,CLOUD_TOP(I))

I have now updated the STASH to match xgtee and restarted the run from Jan 2020. So there should be no differences now between xgeze and xgtee running from 2020.

comment:6 Changed 6 years ago by jcrook

The latest on this is that xgeze ran for 6.5 years and then stopped because it thought the user id was wrong - a problem we get from time to time and we just start it again as a CRUN and it's fine. I didn't want to do that as I don't want to juSt run what Lawrence is running - I want to get on with my own experiments. So I have now made a modification to soil albedo in the ancillary file and set it going again from 2020. This is running ok so far (currently at about 2028). However I also made the STASH match for xgezd and started that again from 2020. It has run to mid 2024 and crashed at the same point it did the first time. Lawrence has also had a run that has done a similar crash after about 20-30 years and it is an RCP4.5 run which we copied from someone else who ran it ok on Monsoon. We've only changed the STASH and completely recompiled and reconfigured since then. Apart from STASH changes this would have been the same as the runs that were done for CMIP5.

comment:7 Changed 6 years ago by willie

Hi Julia,

You could try switching on report STASH messages in Output Choices. Which job is "crashing"?

Regards

Willie

comment:8 Changed 6 years ago by jcrook

I tried just starting xgezd with an NRUN from a couple of months before xgezd crashed - that made no difference - it still crashed in Jul 2024. I have now switched on report STASH messages and started it again from apr 2024.

comment:9 Changed 6 years ago by jcrook

xgezd has crashed in the same place, ie Jul 2024. Will there be something in the latest .leave file to indicate what is happening now I have report STASH messages on?

comment:10 Changed 6 years ago by willie

Hi Julia,

The STASH messages for xgezd have not revealed anything. The job is running for 111 days and then has a segmentation fault. To get further you could repeat the run with more debug information:

  • in Output reports, switch subroutine timer diagnostics on and tick the extra diagnostics button,
  • In scientific sections > section 13 select DIAG_PRN and tick the flush buffer if run fails box, set the printing frequency to every time step and change the vertical velocity threshold from 0.4 to 10.0

I think you need to recompile the executable for these to take effect. This should give some helpful information for PE 86.

You could also try a run starting from the next dump, which I think is xgezda.dam4710. If the problem is model stability then in may run; if not, it should fail immediately.

I hope that helps.

regards

Willie

comment:11 Changed 6 years ago by jcrook

Sorry I've been on holiday so not done anything with this for a while. I have found subroutine timer diagnostics tick box in output choices and selected this but which extra diagnostics button do you mean? I have selected the other settings in section 13 of scientific sections as you suggest and will start this run again from the xgezda.dam4710 dump file.

comment:12 Changed 6 years ago by jcrook

It crashed immediately with segmentation fault as before. Is there anything in the .leave file this time?

comment:13 Changed 6 years ago by willie

Hi Julia,

The leave file shows that it runs for 1518 time steps, becoming unstable at the end. In the last time step you have,

  GCR( 2 ) failed to converge in  100  iterations.

There are two ways to solve this problem. One is to halve the time step: Scientific sections > time stepping and change 72 to 144; the other is to increase the number of iterations allowed, in the hope that it will converge eventually. This can be done in Scientific Sections > section by section > section 10 dynamical solver: change 100 to 200, say.

I hope that helps.

Willie

comment:14 Changed 6 years ago by jcrook

Thanks Willie

I noticed there is also a failure to converge a few timesteps before the end and it usually converges in a lot less than 100 iterations. If I change the timestep frequency will this impact the output? We want to compare this run with a control run so do you think we would need to change the frequency in the control run too even though it hasn't crashed yet?

comment:15 Changed 6 years ago by jcrook

Also are these things that can be changed without recompiling?

comment:16 Changed 6 years ago by jcrook

I changed the number of iterations to 200 and started the run again as before. This time there are no convergence problems but there is still a segmentation fault.
_pmiu_daemon(SIGCHLD): [NID 00041] [c0-1c2s4n3] [Mon Jun 3 12:33:29 2013] PE RANK 22 exit signal Segmentation fault

comment:17 Changed 6 years ago by willie

Hi Julia,

Try halving the time step. You don't need to recompile for this. It will carry out more computations so there is a chance for arithmetical error to grow.

Regards,

Willie

comment:18 Changed 6 years ago by jcrook

I put the number of iterations back to 100 halved the timestep. The run then got through July and August but stopped in September. It is not a segmentation fault but it got terminated. Is this because it ran out of time and if I just set it off as a continuation run it will continue?

comment:19 Changed 6 years ago by willie

Hi Julia,

If you look at the bottom of the last xgezd .leave file, you can see that it has run out of time. You are already using resubmission. You just need to make sure that the chunks can be completed in the allotted time.

Regards

Willie

comment:20 Changed 6 years ago by jcrook

I doubled the time requested and started the run again. This time it didn't get as far as before but has been terminated. It doesn't look like the time allocated is too small.

comment:21 Changed 6 years ago by willie

Hi Julia,

The NRUN in xgezd000.xgezd.d13162.t160947.leave has completed normally. If you want to do a CRUN and avoid running out of time you will need to increase the resubmit job time: just press NEXT when your on the job submission page.

Regards

Willie

comment:22 Changed 6 years ago by jcrook

The reason it stopped in my previous comment was because Hector went down apparently. So then I started it again and this time it has run its 3 months (this is the .leave file you refer to above). I am now going to try putting the timestep back again because it takes so long to run as is. If it crashes again, I'll have to halve the timestep again.

Thanks for your help.

comment:23 Changed 6 years ago by willie

  • Resolution set to fixed
  • Status changed from accepted to closed
Note: See TracTickets for help on using tickets.