run failed

UM Version: 7.6



I have a job xgoua which is failing when it runs. I'm confused by this as I had it working yesterday. I would be grateful if you could have a look. I've copied part of the error message below.

Many thanks,


UM Executable : /work/n02/n02/anmcr/xgoua/bin/xgoua.exe

_pmii_daemon(SIGCHLD): [NID 00298] [c2-1c1s5n0] [Fri Feb 24 15:01:16 2012] PE 40 exit signal Segmentation fault
[NID 00298] 2012-02-24 15:01:16 Apid 1708092: initiated application termination
diff: /work/n02/n02/anmcr/tmp/tmp.hector-xe6-13.32572/xgoua.xhist: No such file or directory
qsexecute: Copying /work/n02/n02/anmcr/xgoua/xgoua.thist to backup thist file /work/n02/n02/anmcr/xgoua/xgoua.thist_keep
xgoua: Run failed

Ending script : qsexecute
Completion code : 137
Completion time : Fri Feb 24 15:01:27 GMT 2012


/work/n02/n02/anmcr/xgoua/bin/qsmaster: Failed in qsexecute in model xgoua

Hi Andrew,

It seems to be crashing immediately. Could you give me read permission on the core file, please and also let me know the result of typing





Hi Willie,

Thanks for your help.

I've given you read permission.

I get the response '0022' if I type umask.

As I said, this job did run last week. Then I modified the location of the um_extracts directory. After this it seemed to not run. So I reverted to the old location of the um_extracts directory, deleted the entire xgoua directory on Hector, and recompiled everything from scratch.


Hi again Willie,

I've just noticed that I have set 'Target machine root extract directory (UM_ROUTDIR)' as '/work/n02/n02/anmcr/' — I don't think the last forward slash should be present. Could this be the cause of the failure?



Hi Andrew,

I don't think it is likely. The core file is produced just as it was about to produce the error message "over-writing due to dim_e _out size". There are many causes for this, so we need to get more details. The thing to do is repeat the run with the subroutine timer diagnostics and extra diagnostics switched on (see output options). In section 13, push the DIAG button and tick the box for flush buffer if run fails.



Hi Willie,

I've just been doing this — I made a copy of the job for this purpose, which is xgzqa. I've copied part of the output file below. Looking at ticket #640 it seems to start from a problem with the reconfiguration. In fact, I remember now that initially the run reconfigured the startdump but had the .orog and .mask ancillaries switched off. I changed this so that N512 .orog and .mask ancillaries were configured. I had thought that the reconfiguration had worked. But maybe not. I've made a copy of this job (xgzqb) in which I am reconfiguring the startdump but keeping the orography and land/sea mask options switched off.


Maximum horizontal wind at timestep 2 Max wind this run

max_wind level proc position run max_wind level timestep

0.214E+03 66 57 80.9deg E 72.7deg N 0.214E+03 66 1

Atm_Step: Timestep 3

initial Absolute Norm : 9595.2338780094433
GCR( 2 ) converged in 48 iterations.
Final Absolute Norm : NaN

WARNING q_POS : 45 points were less than 1.00000000000000002E-8 and have been reset to 1.00000000000000002E-8
WARNING q_POS : All other points unchanged
WARNING q_POS : 1154 points were less than 0. and have been reset to 0.
WARNING q_POS : All other points unchanged

Minimum theta level 1 for timestep 3

This timestep This run

Min theta1 proc position Min theta1 timestep

NaN 2016-12577.5deg W 706.9deg N NaN 3

Largest negative delta theta1 at minimum theta1

This timestep = NaNK. At min for run = NaNK

Maximum vertical velocity at timestep 3 Max w this run

w_max level proc position run w_max level timestep

NaN 69 2016*deg W 706.9deg N NaN 69 3

Hi Willie,

I've got the job to run by switching off the orography and land-sea mask configuration. The job id is now xgzqa.

I'm a little bit concerned about having to do this, as I thought these were the two ancillaries which you had to configure. However, maybe they are only really necessary for a limited-area model run.

If you think that what I have done is ok, then please close this ticket.



