resubmit/qsatmos error?

I have been trying to run xljtk job in UM on Archer (aquaplanet job) and it compiles fine, but then it is trying to resubmit and it fails with this error message:

lib-4212 : UNRECOVERABLE library error
  An internal WRITE tried to write beyond the end of an internal file.

Encountered during a list-directed WRITE to an internal file (character variable)
_pmiu_daemon(SIGCHLD): [NID 01292] [c6-0c2s3n0] [Sat Oct  3 23:00:33 2015] PE RANK 95 exit signal Aborted
[NID 01292] 2015-10-03 23:00:33 Apid 18071650: initiated application termination
xljtk: Run failed
   Ending script   :   qsatmos
   Completion code :   137
   Completion time :   Sat Oct  3 23:00:36 BST 2015

/work/n02/n02/lboljka/um/xljtk/bin/qsmaster: Failed in qsatmos in job xljtk
   Starting script :   qsfinal
   Starting time   :   Sat Oct  3 23:00:36 BST 2015

Checking requirement for atmosphere resubmit...
/work/n02/n02/lboljka/um/xljtk/bin/qsresubmit: Error: no resubmit details found
   Ending script   :   qsfinal
   Completion code :   0
   Completion time :   Sat Oct  3 23:00:36 BST 2015

/work/n02/n02/lboljka/um/xljtk/bin/qsmaster: Failed in qsfinal in job xljtk
 <<<< Information about How Many Lines of Output follow >>>>
 24  lines in main OUTPUT file.
PE0 file for atmos is /work/n02/n02/lboljka/um/xljtk/pe_output/xljtk.fort6.pe000
 199449 lines of O/P from pe0.
 <<<<         Lines of Output Information ends          >>>>

This is from file (but many others in the same folder have the same error as I was trying to rerun many times)

 archer$ /home/n02/n02/lboljka/output/xljtk002.xljtk.d15276.t225505.leave

I am little bit confused as before my resubmits did not crash in other jobs like xljtf. I did change ozone and SSTs quite drastically though.

Do you know where this error could be coming from?

Thank you!

Best wishes

Your leave file appears to indicate some problem with "Slow physics source terms from atmos_physics1" at time step 4889, where it's reporting NaNs?. I notice possibly a related issue: at time step 2160

r_thetav : -0.4570642396242415E+02 0.6167015041783542E+02

but at time step 2161

r_thetav : -0.1683387627549504+213 0.1995851045707727+293


Hi Grenville

So it is probably due to too big changes in SST and ozone.


Best wishes

What are the values of r_thetav for the run without your ozone and SST?


Where can I check that?
My working run was xljtf; output is in archer folder: /work/n02/n02/lboljka/um/xljtf
(I cannot open any of dump files in xconv for some reason…)


I found it in the leave file


The problem is that I do not have .leave files for that job anymore as otherwise my home directory on Archer becomes too full. I am pretty sure they were not infinity as that is not physical and makes the model crash.


Hi Grenville

I have checked .leave files from this run that fails (xljtk) and it seems that the r_thetav term has values around 60 in first month of run all the way to the last time step. Then when I look at the first time step of the following month the values suddenly jump to 10213. There has to be some resubmit error, because these values should not jump so much in no time at all. Also all the fields, like winds, temperature, radiation look ok, none are going to infinity (not even in month2).

Do you know what might be going wrong?

Best wishes

I don't know what your changes do to the model - if it ran OK before your modifications, you'll need to have a closer look at them.


  • Resolution set to answered
  • Status changed from new to closed
