Opened 4 months ago

Last modified 7 weeks ago

#2443 accepted help

u-aw776: problem including tracers in lateral boundary conditions of nesting suite

Reported by: nx902220 Owned by: willie
Priority: normal Component: UM Model
Keywords: craype-hugepages2M Cc:
Platform: Monsoon2 UM Version:


Hi Willie,

I have copied u-at199 revision 74133 which runs all the way through without tracers. I then made the following changes:

To include tracers and feed the UKV tracers into the lateral boundary conditions of the 500 m nest. I get the following error at 500 m forecast:

Warning in umPrintMgr: umPrintExceptionHandler : Handler Invoked[989] exceptions: An exception was raised:11 (Segmentation fault)

I have just tried removing the tracer from the lateral boundary conditions and then tracers are output fine. So the problem must be associated with the lateral boundary conditions.

Please can you help me with this?

Best wishes,


Change History (16)

comment:1 Changed 4 months ago by willie

Hi Lewis,

Try looking through the error logs to see if there are any issues that need correcting. I always

cd ~/cylc-run/<suite-id>/log/job
find . -name job.err -exec ls -l {} \;

This gives a nice list of the error files and their sizes and you can easily look through them.


comment:2 Changed 4 months ago by nx902220

Hi Willie,

I ran the model again but in debug mode this time. The only error I can find is in 500 m forecast stage(there are warnings in job.err from other processes but these are also in successful runs of the suite without tracers so I don't think they are the cause of the problem). From job.out I can see it fails during the first time step.

The segmentation fault I am getting is the same as one you previously helped me with in an earlier version of the suite. The fix was to turn off smag s stash request (which I am still doing since I have taken the change forward in newer versions of the suite). Maybe there is a similar fix to this problem.

Maybe I am doing something silly but I have thought through what I'm doing and have discussed with more experienced UM users. I have checked the error messages but cannot make any progress with them.

If you have any advice I would be much appreciative.

Best wishes,


comment:3 Changed 4 months ago by grenville


job.err has a back trace which shows that the problem is at line 149 in


I'd try switching of stash systematically to see which one might be the problem, alternatively, you could put in some simple write statements around line 194 to see what its objecting to.


(Willie is out for a few days)

comment:4 Changed 3 months ago by nx902220

Hi Grenville,

I've tried turning off all stash requests other than the tracer stash request. I then get past segmentation fault in 500 m forecast. (I had to run UKV and 500m separately since UKV stash needs to be on to create 500m initial and boundary condition files). This suggests the problem is with another stash than tracer. For me to go through all of the stash systematically would take months since one job run takes at least a day due to queuing.

I'm not sure how to do the write statement test you propose. If you still think that would be useful please can you tell me how to do it?

Best wishes,


comment:5 Changed 3 months ago by willie

Hi Lewis,

Are you still getting the STASH error? The job seems to have carried on past the point in comment 4.


comment:6 Changed 3 months ago by nx902220

Hi Willie,

If you are referring to the model run which is running at the moment then yes it has got past the STASH error.

This is because I turned all stash request off except for the tracer stash request. The STASH error must be associated with a stash request other than tracer.

I need to be able to run the suite with most stash requests on e.g. timeseries stash request and stash requests required to make LBCs for the next nest.

I suppose one test would be to turn all of those stash requests on and turn off those which are not important in the hope it is one of the unimportant stash requests which is the problem.



comment:7 Changed 3 months ago by willie

Hi Lewis,

Can you back up to the point where the STASH error occurred and repeat the run to get the error again. Then I'll be able to have a more detailed look.


comment:8 Changed 3 months ago by nx902220

Hi Willie,

I have just run the 500 m nest with STASH for tracer and all of the prognostics used to make boundary conditions for the 300 m nest turned on without error.

I will try running it again with more stash turned on but only those that I know I need (some before were just on because they were when I picked up the suite).

If I get the STASH error when I do this I will let you know.



comment:9 Changed 3 months ago by nx902220

Hi Willie,

The run I mentioned in comment 8 has failed with the STASH error. The previous run with less stash turned on did not fail.

A change set between these 2 runs is here:

If you could help me find out what the problem is it would be much appreciated.



comment:10 Changed 3 months ago by willie

  • Owner changed from um_support to willie
  • Status changed from new to accepted

Hi Lewis,

I have made some progress with this. Firstly, it has nothing to do with STASH. I increased the process count for the 500m forecast job to 36x36=1296 cores and this changed the error to

?  Error code: 300
?  Error from routine: UM_WRITDUMP
?  Error message: Failure to gather field
?  Error from processor: 1079
?  Error number: 22

after 360 time steps (= one hour) when writing dymeca_da000.

Switching on flush buffers (IO System settings > print Manager control pmt_force_flush) reveals further information:

Return code was  10
Error message was GENERAL_GATHER_FIELD : Field type not recognized
Field number  2
Dimensions  600  x  600
Grid type  4617315517961601024
Field was not written out

I don't yet know what field 2 is. The problem is the grid type is wrong. There are two possibilities: either there is another error corrupting the grid type, or else it is something you have changed in the 500m forecast set up.

Do you have an example of dymeca_da000 from a previous closely related run that we could compare with?


comment:11 Changed 3 months ago by nx902220

Hi Willie,

Thank you for exploring this.

I have output from a similar suite in the following directory:
Although there is not a dymeca_da000. The first output is dymeca_da001.

It is probably unrelated but I thought I should mention anyway that I have always had a warning that doesn't stop the run in job.err for ukv_start (it happens for all of my nesting suites):

[lblunn@exvmsrose:~/cylc-run/u-aw764/log/job/20160504T0300Z/ukv_start/01]$ cat job.err
/projects/um1/lib/python2.7/mule/ UserWarning?:
File: /home/d04/lblunn/cylc-run/u-aw764/share/cycle/20160504T0300Z/ukv_frame
Field validation failures:

Fields (0,1,2,3,4, … 31213 total fields) Cannot validate field due to incompatible grid type:
File grid : 0
Field grid: 3

Best wishes,


comment:12 Changed 2 months ago by willie

  • Keywords craype-hugepages2M added
  • Platform set to Monsoon2

Hi Lewis,

In a wild and random move, I now have the 500m model working with tracers. See my u-ay522. The secret is to delete the huge pages module from the 500m environment in suite.rc. I have no idea why this works. So, nothing to do with STASH, processors or writing dumps. It's probably a good idea to make some of the changes I suggested in other notes to you, to make the model a bit easier to use.


comment:13 Changed 8 weeks ago by willie

  • Resolution set to fixed
  • Status changed from accepted to closed

comment:14 Changed 7 weeks ago by nx902220

Hi Willie,

I took a copy of your suite u-ay522 and it is called u-az169. A change set between the 2 suites is below:

The suite successfully runs 500m_fcst_00 and I have checked output.

The suite fails at 500m_fcst_01 with job.err:

doStderr: received 397 bytes ALPSSTDIO_MSGTEXT
?????????????????????????????? WARNING ??????????????????????????????
? Warning code: -20
? Warning from routine: CHECK_DUMP_PACKING
? Warning message:
? Packing codes in dump inconsistent with DUMP_PACKim.
? Packing codes updated.
? Warning from processor: 0
? Warning number: 20

control_loop: numPoll 4, poll returned 1
control_loop: received message on control fd 14
processControlMsg: received stderr message
stderr: <?xml version="1.0"?><methodCall><methodName>stderr</methodName><params><param><value><struct><member><name>msgtext</name><value><base64>WzExMjldIGV4Y2VwdGlvbnM6IEFuIGV4Y2VwdGlvbiB3YXMgcmFpc2VkOjE.
doStderr: received 840 bytes ALPSSTDIO_MSGTEXT
[1129] exceptions: An exception was raised:11 (Segmentation fault)
[1131] exceptions: An exception was raised:11 (Segmentation fault)
[1129] exceptions: the exception reports the extra information: Sent by the kernel.
[1131] exceptions: the exception reports the extra information: Sent by the kernel.
[1129] exceptions: whilst in a serial region
[1131] exceptions: whilst in a serial region
[1129] exceptions: Task had pid=68390 on host nid01012
[1131] exceptions: Task had pid=68392 on host nid01012
[1129] exceptions: Program is "/home/d04/lblunn/cylc-run/u-az169/share/fcm_make/build-atmos/bin/um-atmos.exe"
[1131] exceptions: Program is "/home/d04/lblunn/cylc-run/u-az169/share/fcm_make/build-atmos/bin/um-atmos.exe"
[1129] exceptions: calling registered handler @ 0x20019d80
[1131] exceptions: calling registered handler @ 0x20019d80

Best wishes,


comment:15 Changed 7 weeks ago by willie

  • Resolution fixed deleted
  • Status changed from closed to reopened

Hi Lewis,
OK. I'll have a look - I'm not very hopeful though.


comment:16 Changed 7 weeks ago by willie

  • Status changed from reopened to accepted
Note: See TracTickets for help on using tickets.