Opened 5 years ago

Closed 5 years ago

#1287 closed error (fixed)

problems with nudged runs

Reported by: cwright Owned by: um_support
Component: Other Keywords: 6.63, nudging, error
Cc: Platform: ARCHER
UM Version: 6.6.3

Description

Hi,

I'm having a problem with nudged runs (using the UKCA nudged version of the model). The runs compile fine, but crash on an early time step, with the following error:

=========================================
lib-4029 : UNRECOVERABLE library error

An underlying C library read or write request failed.

Encountered during a list-directed WRITE to unit 6
Fortran unit 6 is connected to a sequential formatted text file:

"/work/n02/n02/cwright/xiwxd/xiwxd.fort6.pe61"

lib-4029 : UNRECOVERABLE library error

An underlying C library read or write request failed.

Encountered during a list-directed WRITE to unit 6
Fortran unit 6 is connected to a sequential formatted text file:

"/work/n02/n02/cwright/xiwxd/xiwxd.fort6.pe109"

lib-4029 : UNRECOVERABLE library error

An underlying C library read or write request failed.

Encountered during a list-directed WRITE to unit 6
Fortran unit 6 is connected to a sequential formatted text file:

"/work/n02/n02/cwright/xiwxd/xiwxd.fort6.pe32"

_pmiu_daemon(SIGCHLD): [NID 01507] [c7-0c2s8n3] [Fri May 2 10:41:28 2014] PE RANK 61 exit signal Aborted
_pmiu_daemon(SIGCHLD): [NID 01524] [c7-0c2s13n0] [Fri May 2 10:41:28 2014] PE RANK 109 exit signal Aborted
_pmiu_daemon(SIGCHLD): [NID 01469] [c7-0c1s15n1] [Fri May 2 10:41:28 2014] PE RANK 32 exit signal Aborted
[NID 01507] 2014-05-02 10:41:28 Apid 8110014: initiated application termination
=============================================

Examples of the full output are in my umui_out on Archer - anything with xiwxb or xiwxd is an example. It last worked when I was on Hector - is it possible there's a necessary library which didn't make the move?

Corwin

Change History (3)

comment:1 Changed 5 years ago by grenville

Corwin

It might be that you have run out of disc space on /work. I have increased your quota - it may take a few hrs to become available.

Grenville

comment:2 Changed 5 years ago by cwright

Hi Grenville,

that does seem to have fixed it - I've had the model running for four hours now and still going, whereas before the increase it was dying much earlier.

I had one issue while starting the run going which might be useful for you to have a note of: the first time I compiled the model run after your quota increase, the compiled directory (~/work/xiwxd) ended up being 138GB in total, with each individual .pe[n] file being many GB in size. After compiling a second time, this fell to only 19GB total, hence why it's still running fine. In both cases I'd completely deleted the directory beforehand. Combined with my ongoing problem of seemingly-random compile times for identical jobs (http://cms.ncas.ac.uk/ticket/1281#comment:2), it looks like something very odd is going on with the compiler on Archer. It's not a desperately urgent issue for me- I can always recompile, and I don't need to do many distinct jobs - but this might be useful information if anyone else is having a similar problem!

comment:3 Changed 5 years ago by grenville

  • Resolution set to fixed
  • Status changed from new to closed

Corwin

Thanks. ARCHER are trying to understand the poor compile performance as a matter of some urgency.

Grenville

Note: See TracTickets for help on using tickets.