XCM qsreconf times out - does not exit PBS script

Reported by: markr Owned by: um_support
Component: UM Model Keywords: recon
Cc: Platform: MONSooN
UM Version: 8.6


Hello, (see job xlsqe)
I have an error with qsreconf (not sure yet what it is) at first I thought it was not enough time on the PBS job but it seems to hang and use all the job time before exiting

Here is the cryptic msg:
???!!!???!!!???!!!???!!!???!!! ERROR ???!!!???!!!???!!!???!!!???!!!???!!!
? Error Code: 24
? Error Message: Error in buffin errorCode=3.00 len= 256/ 256
? Error from processor: 0
? Error number: 2

Rank 0 [Wed Sep 2 14:43:09 2015] [c0-0c1s12n1] application called MPI_Abort(MPI_COMM_WORLD, 9) - process 0
Application 157015 is crashing. ATP analysis proceeding…

ATP Stack walkback for Rank 0 starting:


ATP Stack walkback for Rank 0 done
Process died with signal 6: 'Aborted'
atpFrontend.exe: collectAndWriteMBT: graphlib error exporting graph to dot format
atpFrontend.exe: main: main: Failed to collectAndWriteMBT for whole app
⇒> PBS: job killed: walltime 994 exceeded limit 900
aprun: Apid 157015: Caught signal Terminated, sending to application

Change History (6)

comment:1 Changed 4 years ago by markr

I suspect the path to file is not correct (the data layout on XCm is different to IBM02).

comment:2 Changed 4 years ago by grenville


Errorcode = 3 means it can't find the file

gmslis@xcml00:~> ls /projects/ocean/hadgem3/initial/atmos/N96L85/antiaa.da19880901_00
ls: cannot access /projects/ocean/hadgem3/initial/atmos/N96L85/antiaa.da19880901_00: No such file or directory

It's bad that the model didn't exit - something else to look at.


comment:3 Changed 4 years ago by ros

Hi Mark,

Just to add to this I have now copied the missing dump over to the XCM so your job should now run.


comment:4 Changed 4 years ago by markr

Thanks Ros! Will try it now.
There is probably still an issue about reconf not exiting on failure gracefully from PBS.
I originally thought I had not allowed enough time (5minutes) and changed it to 15minutes but still the job timed out (now I understand that I should review the full log to work out why it timed out).

comment:5 Changed 4 years ago by markr

Just a note that XLSQB had run on Aug 21st and that dump file was used then.
and on 24th Aug and 25th Aug:

xlsqb000.xlsqb.d15237.t112247.rcf.leave:OPEN: File /projects/ocean/hadgem3/initial/atmos/N96L85/antiaa.da19880901_00 to be Opened on Unit 10 Exists
xlsqb000.xlsqb.d15237.t112247.rcf.leave:CLOSE: File /projects/ocean/hadgem3/initial/atmos/N96L85/antiaa.da19880901_00 Closed on Unit 10
xlsqb000.xlsqb.d15237.t112247.rcf.leave:IO: Open: /projects/ocean/hadgem3/initial/atmos/N96L85/antiaa.da19880901_00 on unit 10
xlsqb000.xlsqb.d15237.t112247.rcf.leave:Input dump : /projects/ocean/hadgem3/initial/atmos/N96L85/antiaa.da19880901_00
xlsqb000.xlsqb.d15237.t112247.rcf.leave:IO: Close: /projects/ocean/hadgem3/initial/atmos/N96L85/antiaa.da19880901_00 on unit 10

comment:6 Changed 4 years ago by ros

  • Resolution set to answered
  • Status changed from new to closed

Hi Mark,

I will close this ticket now but have made a note to investigate the issue of not exiting properly in due course. (


