Opened 4 years ago

Closed 4 years ago

#1642 closed help (answered)

XCM qsreconf times out - does not exit PBS script

Reported by: markr Owned by: um_support
Component: UM Model Keywords: recon
Cc: Platform: MONSooN
UM Version: 8.6

Description

Hello, (see job xlsqe)
I have an error with qsreconf (not sure yet what it is) at first I thought it was not enough time on the PBS job but it seems to hang and use all the job time before exiting

Here is the cryptic msg:
????????????????????????????????????????????????????????????????????????????????
???!!!???!!!???!!!???!!!???!!! ERROR ???!!!???!!!???!!!???!!!???!!!???!!!
? Error Code: 24
? Error Message: Error in buffin errorCode=3.00 len= 256/ 256
? Error from processor: 0
? Error number: 2
????????????????????????????????????????????????????????????????????????????????

Rank 0 [Wed Sep 2 14:43:09 2015] [c0-0c1s12n1] application called MPI_Abort(MPI_COMM_WORLD, 9) - process 0
Application 157015 is crashing. ATP analysis proceeding…

ATP Stack walkback for Rank 0 starting:

_start@…:113
libc_start_main@…:242
main@…:78
rcf_initialise$rcf_initialise_mod_@…:246
rcf_files_init$rcf_files_init_mod_@…:183
rcf_readumhdr$rcf_readumhdr_mod_@…:93
read_flh_@…:59
buffin64_i$io_@…:1890
io_ereport$io_@…:494
ereport64$ereport_mod_@…:119
gc_abort_@…:136
mpl_abort_@…:46
pmpi_abort@0x873d2c
MPI_Abort@0x8921a4
MPID_Abort@0x8ba661
abort@…:92
raise@…:42

ATP Stack walkback for Rank 0 done
Process died with signal 6: 'Aborted'
atpFrontend.exe: collectAndWriteMBT: graphlib error exporting graph to dot format
atpFrontend.exe: main: main: Failed to collectAndWriteMBT for whole app
⇒> PBS: job killed: walltime 994 exceeded limit 900
aprun: Apid 157015: Caught signal Terminated, sending to application

Change History (6)

comment:1 Changed 4 years ago by markr

I suspect the path to file is not correct (the data layout on XCm is different to IBM02).

comment:2 Changed 4 years ago by grenville

MArk

Errorcode = 3 means it can't find the file

gmslis@xcml00:~> ls /projects/ocean/hadgem3/initial/atmos/N96L85/antiaa.da19880901_00
ls: cannot access /projects/ocean/hadgem3/initial/atmos/N96L85/antiaa.da19880901_00: No such file or directory

It's bad that the model didn't exit - something else to look at.

Grenville

comment:3 Changed 4 years ago by ros

Hi Mark,

Just to add to this I have now copied the missing dump over to the XCM so your job should now run.

Regards,
Ros.

comment:4 Changed 4 years ago by markr

Thanks Ros! Will try it now.
There is probably still an issue about reconf not exiting on failure gracefully from PBS.
I originally thought I had not allowed enough time (5minutes) and changed it to 15minutes but still the job timed out (now I understand that I should review the full log to work out why it timed out).

comment:5 Changed 4 years ago by markr

Just a note that XLSQB had run on Aug 21st and that dump file was used then.
and on 24th Aug and 25th Aug:

xlsqb000.xlsqb.d15237.t112247.rcf.leave:OPEN: File /projects/ocean/hadgem3/initial/atmos/N96L85/antiaa.da19880901_00 to be Opened on Unit 10 Exists
xlsqb000.xlsqb.d15237.t112247.rcf.leave:CLOSE: File /projects/ocean/hadgem3/initial/atmos/N96L85/antiaa.da19880901_00 Closed on Unit 10
xlsqb000.xlsqb.d15237.t112247.rcf.leave:IO: Open: /projects/ocean/hadgem3/initial/atmos/N96L85/antiaa.da19880901_00 on unit 10
xlsqb000.xlsqb.d15237.t112247.rcf.leave:Input dump : /projects/ocean/hadgem3/initial/atmos/N96L85/antiaa.da19880901_00
xlsqb000.xlsqb.d15237.t112247.rcf.leave:IO: Close: /projects/ocean/hadgem3/initial/atmos/N96L85/antiaa.da19880901_00 on unit 10

comment:6 Changed 4 years ago by ros

  • Resolution set to answered
  • Status changed from new to closed

Hi Mark,

I will close this ticket now but have made a note to investigate the issue of not exiting properly in due course. (http://puma.nerc.ac.uk/trac/UM/ticket/829)

Cheers,
Ros.

Note: See TracTickets for help on using tickets.