Opened 6 years ago

Closed 6 years ago

#1067 closed help (answered)

Failure of model run using IAU scheme - segmentation fault

Reported by: cflee Owned by: um_support
Component: UM Model Keywords: IAU, perturbation
Cc: Platform: HECToR
UM Version: 6.6.3

Description

I'm trying to run a series of short (32 day) jobs, using the IAU scheme to perturb initial conditions. Each run uses a different set of initial conditions (perturbations to U and V only), to represent an ensemble around an un-perturbed run. Some of the perturbed runs have worked; others have not! The latest attempt (/home/n02/n02/cflee/um/umui_out/xhjqt000.xhjqt.d13131.t193852.leave) has failed with a 'PE RANK 112 exit signal Segmentation fault'. Is this because the model is becoming unstable as a result of the perturbations to U and V that I've introduced?

The same run gave a different (and more typical) fault after an earlier attempt - this time segmentation faults 32 and 0 (xhjqt000.xhjqt.d13130.t104145.leave).

Many thanks,

Chris

Change History (9)

comment:1 Changed 6 years ago by willie

Hi Chris,

Both runs fail for the same reason: after 146 time steps NaNs? appear. If you do check setup in the UMUI, there is a STASH error: T6HDMmon time profile is incorrect. You can ignore the user STASH error - this is due to the epflux606 STASH master file. If you go to the STASH page and verify STASH there are a large number of usage profile errors. These can be eliminated by defining a meaning sequence: go to Atmos > Control > Post Proc. dumping and meaning and tick the "define a meaning sequence" button.

Your start dump seems ok. If the errors persist after this, then you may need to half the time step.

Regards,

Willie

comment:2 Changed 6 years ago by cflee

Afternoon Willie,

Thank you for the suggestions. I deleted the T6hDMmon time profile (it wasn't being used) and defined a meaning sequence, but I'm still getting the same problem. Unfortunately, halving the time is not really feasible for the sort of experiments we want to do; I think I'm going to have to change the experimental set-up instead. Can you tell me how I can look at the output that contains the NaNs??

Thanks again,

Chris.

comment:3 Changed 6 years ago by willie

Hi Chris,

We can now run for 2376 time steps and the NaNs? in the convergence have disappeared. In times steps 65 to 69 it fails to converge in 100 iterations but seems to recover. In the .leave file, the error now is

Failure in INITTIME
:
INITTIME: Model calendar doesn't match atmos dump

:
  ANCIL_REFTIME set by User Interface =  1958,  12,  1,  3*0

Your start dump is dated 1979/01/01, so I think you need to change the ancillary reference time to match the start dump (UMUI Ancillary > In file related).

Regards,

Willie

comment:4 Changed 6 years ago by cflee

Morning Willie,

I think I've confused the situation by doing another run with a different set of perturbations! I've just done another run with the original files of that were generating the segmentation faults, and faults have been repeated. Is there any way I can look at the NaNs? for this job: xhjqh000.xhjqh.d13135.t101210.leave ?

Thanks again,

Chris.

comment:5 Changed 6 years ago by willie

Hi Chris,

I search the .leave file for "NaN". There are two main causes: they are present in your input data or they are generated by the algorithm.

If you are using nonstandard ancillary files, these should be checked by cumf'ing them with themselves:

cumf ancfile ancfile

the summary output should show no differences. If the inputs are OK then it is down to the stability of the algorithm.

Regards

Willie

comment:6 Changed 6 years ago by cflee

Afternoon Willie,

I've managed to narrow down the problem a little bit.
I've performed 10 runs using ancillary files with slight perturbations to U and V, and four of these ran to completion. Unfortunately the velocities of these 4 runs are not clustered around the original model run, though they do cluster around one-another until the they start to diverge at about 10 days. The typical bias is around 10 m/s, compared to the original model run.
Based on the above, I think the problem is either related to the implementation of the ancillary file that I'm asking the model the use, probably related to user error when making the ancillary files! I've plotted the U and V values in my ancillary files against the original model run, and they correspond. I've followed the xancil instructions, though there are some gaps so I haven't been able to check everything that I've done. Would you be able to give me any pointers?
The ancillary files I've created are in: '/cflee/work/xhjqh/ssw_one/perturb/' . The files with '..anc*.nc' are the ones I've created using xancil (the xancil job is saved as 'ancil_file_create.job' in the same directory). The file ssw_one_anc9.nc is an un-perturbed version of the original model run, but using this makes the model unstable.

Many thanks,

Chris.

comment:7 Changed 6 years ago by willie

  • Keywords IAU, perturbation added
  • Status changed from new to assigned

Hi Chris,

I checked the '..anc*.nc' ancillary files and there are no NaNs? present. I am not familiar enough with the IAU method to comment further.

Regards,

Willie

comment:8 Changed 6 years ago by cflee

Afternoon Willie,

Do you know of anyone who knows about the IAU method? Or do you know the names of the parent routine used by the IAU so that I can have a look at it?

Thanks again,

Chris.

comment:9 Changed 6 years ago by willie

  • Resolution set to answered
  • Status changed from assigned to closed
Note: See TracTickets for help on using tickets.