Opened 4 years ago

Closed 4 years ago

#1685 closed help (answered)

resubmit/qsatmos error?

Reported by: lboljka Owned by: um_support
Component: UM Model Keywords:
Cc: Platform: ARCHER
UM Version: 8.6

Description

Hi

I have been trying to run xljtk job in UM on Archer (aquaplanet job) and it compiles fine, but then it is trying to resubmit and it fails with this error message:

lib-4212 : UNRECOVERABLE library error
  An internal WRITE tried to write beyond the end of an internal file.

Encountered during a list-directed WRITE to an internal file (character variable)
_pmiu_daemon(SIGCHLD): [NID 01292] [c6-0c2s3n0] [Sat Oct  3 23:00:33 2015] PE RANK 95 exit signal Aborted
[NID 01292] 2015-10-03 23:00:33 Apid 18071650: initiated application termination
xljtk: Run failed
*****************************************************************
   Ending script   :   qsatmos
   Completion code :   137
   Completion time :   Sat Oct  3 23:00:36 BST 2015
*****************************************************************


/work/n02/n02/lboljka/um/xljtk/bin/qsmaster: Failed in qsatmos in job xljtk
***************************************************************
   Starting script :   qsfinal
   Starting time   :   Sat Oct  3 23:00:36 BST 2015
***************************************************************

Checking requirement for atmosphere resubmit...
/work/n02/n02/lboljka/um/xljtk/bin/qsresubmit: Error: no resubmit details found
*****************************************************************
   Ending script   :   qsfinal
   Completion code :   0
   Completion time :   Sat Oct  3 23:00:36 BST 2015
*****************************************************************

/work/n02/n02/lboljka/um/xljtk/bin/qsmaster: Failed in qsfinal in job xljtk
 <<<< Information about How Many Lines of Output follow >>>>
 24  lines in main OUTPUT file.
PE0 file for atmos is /work/n02/n02/lboljka/um/xljtk/pe_output/xljtk.fort6.pe000
 199449 lines of O/P from pe0.
 <<<<         Lines of Output Information ends          >>>>

This is from file (but many others in the same folder have the same error as I was trying to rerun many times)

 archer$ /home/n02/n02/lboljka/output/xljtk002.xljtk.d15276.t225505.leave

I am little bit confused as before my resubmits did not crash in other jobs like xljtf. I did change ozone and SSTs quite drastically though.

Do you know where this error could be coming from?

Thank you!

Best wishes
Lina

Change History (9)

comment:1 Changed 4 years ago by grenville

Lina

Your leave file appears to indicate some problem with "Slow physics source terms from atmos_physics1" at time step 4889, where it's reporting NaNs?. I notice possibly a related issue: at time step 2160

r_thetav : -0.4570642396242415E+02 0.6167015041783542E+02

but at time step 2161

r_thetav : -0.1683387627549504+213 0.1995851045707727+293

Grenville

comment:2 Changed 4 years ago by lboljka

Hi Grenville

So it is probably due to too big changes in SST and ozone.

Thanks.

Best wishes
Lina

comment:3 Changed 4 years ago by grenville

Lina

What are the values of r_thetav for the run without your ozone and SST?

Grenville

comment:4 Changed 4 years ago by lboljka

Grenville

Where can I check that?

Thanks
Lina

Version 0, edited 4 years ago by lboljka (next)

comment:5 Changed 4 years ago by grenville

I found it in the leave file

Grenville

comment:6 Changed 4 years ago by lboljka

The problem is that I do not have .leave files for that job anymore as otherwise my home directory on Archer becomes too full.
Lina

Last edited 4 years ago by lboljka (previous) (diff)

comment:7 Changed 4 years ago by lboljka

Hi Grenville

I have checked .leave files from this run that fails (xljtk) and it seems that the r_thetav term has values around 60 in first month of run all the way to the last time step. Then when I look at the first time step of the following month the values suddenly jump to 10213. There has to be some resubmit error, because these values should not jump so much in no time at all. Also all the fields, like winds, temperature, radiation look ok, none are going to infinity (not even in month2).

Do you know what might be going wrong?

Thanks.
Best wishes
Lina

comment:8 Changed 4 years ago by grenville

Lina

I don't know what your changes do to the model - if it ran OK before your modifications, you'll need to have a closer look at them.

Grenville

comment:9 Changed 4 years ago by lboljka

  • Resolution set to answered
  • Status changed from new to closed
Note: See TracTickets for help on using tickets.