Opened 6 years ago

Closed 6 years ago

#1158 closed help (fixed)

Segmentation fault running with customised SST and sea ice ancils

Reported by: jscreen Owned by: willie
Component: UM Model Keywords: sst sea ice ancillary segmentation fault
Cc: Platform: HECToR
UM Version: 6.6.3

Description

Hi

I am trying to set up some AMIP-style runs with HadGEM2 using customised SST and sea ice ancillaries. My starting point was a job under the umui username. I have made some changes to this job (mostly switching off the updating of various ancils) and successfully run the model (xjaki). Next, I wanted to change the SST and sea ice ancillaries. The job xjakj fails with a segmentation fault and the following error:

_pmiu_daemon(SIGCHLD): [NID 01692] [c13-1c0s1n0] [Thu Oct 24 09:50:14 2013] PE RANK 84 exit signal Segmentation fault
[NID 01692] 2013-10-24 09:50:14 Apid 6058231: initiated application termination
diff: /work/n02/n02/jscreen/tmp/tmp.hector-xe6-14.15250/xjakj.xhist: No such file or directory
qsexecute: Copying /work/n02/n02/jscreen/xjakj/xjakj.thist to backup thist file /work/n02/n02/jscreen/xjakj/xjakj.thist_keep
xjakj: Run failed

See: /home/n02/n02/jscreen/output/xjakj000.xjakj.d13297.t104411.leave .

I have given read access to this file and the ancils below:

/work/n02/n02/jscreen/ancils/sst_arcoga_1980-1999
/work/n02/n02/jscreen/ancils/sic_arcoga_1980-1999

The only differences between the working job (xjaki) and the failing job (xjakj) are the SST and sea ice ancillaries. The working job uses the AMIP ancillary files provided on HECToR (e.g. sst_amip_1870-2008). The failed jobs uses my custom-made ancillaries above. My ancils were created in xancil and originally the data came from output of coupled historical runs of HadGEM2-ES. I have regrided the coupled model output (on the ocean grid) to the N96 atmospheric grid using xconv and interpolated over land to avoid problems with differing land-sea masks. The ancils look okay to me, but maybe I am missing something?

I've spent 2 days trying to figure out the problem, but to no avail. I've tried implementing a number of changes to the ancils and the model setup, but each time the model crashes (quickly, I'm only attempting to run for 1 model day). I've tried various things (based on previous threads on the webpage) including:

1) Using AMIPII method (i.e. not specifying the sea ice depth)
2) Using monthly updating rather than daily updating (as the ancils are monthly means)
3) Setting min sea ice fraction to 0.3 in xancil
4) Setting SST over sea ice to 271.35 in xancil
5) Starting from a reconfigured initial dump

Perhaps you can do a better job than me of identifying the cause of the problem and stop me pulling my hair out!

Cheers,
James

Change History (9)

comment:1 Changed 6 years ago by willie

Hi James,
Could you give me read privilege on your .leave files please:

cd ~
chmod -R g+rX .

Regards

Willie

comment:2 Changed 6 years ago by jscreen

Hi Willie

As far as I can tell you should have read access to the .leave files

James

comment:3 Changed 6 years ago by willie

Hi James,

You can check the ancillaries by cumf'ing them with themselves. If you look in the summary files there should be no differences. If there are NaN's in the files this will show as a difference.

If the ancillaries are OK, then to get further we need to switch on some debug. Rebuild the executable with

  • on the output options page, switch on the subroutine timer and select operational status messages
  • in scientific section, section 13 push DIAG_PRN and select flush buffer if run fails and operational prints.

This will allow us to see how many time steps it does before failing. Then repeat the run.

Could you also give me read permission on the work directory, please:

cd /work/n02/n02/jscreen/xjakj
chmod -R g+rX .

Regards

Willie

comment:4 Changed 6 years ago by jscreen

Willie

I've double-checked the ancillaries and there are no NaNs?, so I don't think this is the cause of the problem. You can see the output from cumf'ing in my home directory.

I've switched on the debugging options you suggested, rebuilt the executable and submitted the run. It fails as before. The .leave file is:

/home/n02/n02/jscreen/output/xjakj000.xjakj.d13301.t113621.leave

I've given you read access to the work directory too.

Cheers,
James

comment:5 Changed 6 years ago by willie

Hi James,

Thanks. So it fails in the first time step with NaN's in the theta_start L2 norm. Have you looked at the SST ancil in xconv? Do the temperature values look reasonable?

I don't have read permissions on the ancils or the work directory - perhaps you could give me permission on the directory above, recursively.

Regards,

Willie

comment:6 Changed 6 years ago by jscreen

Willie

Yes, the SST (and sea ice) values look reasonable. Feel free to have a look yourself.

You should have read and executable permissions for the work directory and the ancils.

James

comment:7 Changed 6 years ago by willie

Hi James,

I've tried reducing the time step and increasing the number of iterations from 50 to 200 in the solver (Section 10) to no avail. Using the xjaki executable (xjava) did not help either.

Failure in time step 1 suggests inconsistent initial conditions.

I don't know the physics/meteorology of what you are doing, so cannot offer advice. It may be useful to compare with a standard job (user umui, job xgadd) and reconsider the differences.

Sorry if this is not very helpful.

Regards,

Willie

comment:8 Changed 6 years ago by annette

  • Owner changed from um_support to willie
  • Status changed from new to assigned

comment:9 Changed 6 years ago by willie

  • Resolution set to fixed
  • Status changed from assigned to closed
Note: See TracTickets for help on using tickets.