Opened 9 months ago

Closed 8 months ago

#2720 closed help (fixed)

Error suite u-bd704 on second time step

Reported by: xd904476 Owned by: um_support
Component: Coupled model Keywords: coupled
Cc: Platform: ARCHER
UM Version: 10.7

Description

Hi, I am running suite u-bd704 and it fails at te second time step.
The suite has created a History_data directory on Archer, but then it fails with the following messages.
What could be the problem?

Thanks,
Dani

std_err

Rank 198 [Tue Jan 8 17:31:16 2019] [c4-0c1s4n2] application called MPI_Abort(MPI_COMM_WORLD, 1) - process 198
_pmiu_daemon(SIGCHLD): [NID 00850] [c4-0c1s4n2] [Tue Jan 8 17:31:16 2019] PE RANK 198 exit signal Aborted
[NID 00850] 2019-01-08 17:31:17 Apid 33158802: initiated application termination
[FAIL] run_model # return-code=137
Received signal ERR
cylc (scheduler - 2019-01-08T17:31:42Z): CRITICAL Task job script received signal ERR at 2019-01-08T17:31:42Z
cylc (scheduler - 2019-01-08T17:31:42Z): CRITICAL failed at 2019-01-08T17:31:42Z

std_out
icebergs, read_restart_bergs: # bergs = 0 on PE 42

ice: Error writing time variable

Application 33158802 exit codes: 134
Application 33158802 exit signals: Killed
Application 33158802 resources: utime ~575s, stime ~8

Change History (4)

comment:1 Changed 9 months ago by willie

  • Component changed from UM Model to Coupled model
  • Keywords coupled added
  • Platform set to ARCHER
  • UM Version set to 10.7

Hi Dani,

It has just completed the second cycle time for me, so try doing a fresh start

rose suite-run --new

Willie

comment:2 Changed 8 months ago by xd904476

Hi Willie, I did try this, but I get an error about unretrievable libraries: it seems to occur when dealing with icebergs, but I don't know how to fix it.
This is the stderr output I get:



Current format: 200 FORMAT(a19,10(a18,"=",es14.7,x,a2,:,","))


lib-4029 : UNRECOVERABLE library error

An underlying C library read or write request failed.

lib-4029 : UNRECOVERABLE library error

An underlying C library read or write request failed.

lib-4029 : UNRECOVERABLE library error

An underlying C library read or write request failed.

lib-4029 : UNRECOVERABLE library error

An underlying C library read or write request failed.

lib-4029 : UNRECOVERABLE library error

An underlying C library read or write request failed.

Encountered during a sequential formatted WRITE to
Encountered during a sequential formatted WRITE to
Encountered during a sequential formatted WRITE to
Encountered during a sequential formatted WRITE to unit 18

unit 18

Encountered during a sequential formatted WRITE to unit 18
Fortran unit 18 is
Encountered during a sequential formatted WRITE toconnected to unit 18
a sequential formatted text fileFortran unit 18 is :

"icebergs.stat_0020"

connected to Current format: a sequential formatted text file 200 FORMAT:

"icebergs.stat_0012"

(a19,10(a18,"=",es14.7,x,a2,:,","))

Current format: 200 FORMAT (a19,10(a18,"=",es14.7,x,a2,:,","))



Fortran unit 18 is
Encountered during a sequential formatted WRITE toFortran unit 18 is unit 18

unit 18

connected to unit 18
connected to Fortran unit 18 is a sequential formatted text file:

"icebergs.stat_0032"

a sequential formatted text file Current format: :

"icebergs.stat_0045"
200 FORMAT(a19,10(a18,"=",es14.7,x,a2,:,","))

Current format: 200 FORMAT Fortran unit 18 is (a19,10(a18,"=",es14.7,x,a2,:,","))
connected to a sequential formatted text file :

"icebergs.stat_0031"

Current format: 200 FORMAT (a19,10(a18,"=",es14.7,x,a2,:,","))



Fortran unit 18 is connected to a sequential formatted text file:

"icebergs.stat_0030"

Current format: 200 FORMAT(a19,10(a18,"=",es14.7,x,a2,:,","))

connected to a sequential formatted text file :

"icebergs.stat_0046"
Current format: 200 FORMAT (a19,10(a18,"=",es14.7,x,a2,:,","))




lib-4029 : UNRECOVERABLE library error

An underlying C library read or write request failed.

Encountered during a sequential formatted WRITE to unit 18
Fortran unit 18 is connected to a sequential formatted text file:

"icebergs.stat_0023"

Current format: 200 FORMAT(a19,10(a18,"=",es14.7,x,a2,:,","))


_pmiu_daemon(SIGCHLD): [NID 00100] [c0-0c1s9n0] [Thu Jan 10 00:59:00 2019] PE RANK 247 exit signal Aborted
_pmiu_daemon(SIGCHLD): [NID 00099] [c0-0c1s8n3] [Thu Jan 10 00:59:00 2019] PE RANK 228 exit signal Aborted
_pmiu_daemon(SIGCHLD): [NID 00097] [c0-0c1s8n1] [Thu Jan 10 00:59:00 2019] PE RANK 210 exit signal Aborted
[NID 00097] 2019-01-10 00:59:00 Apid 33165286: initiated application termination
[FAIL] run_model # return-code=137
Received signal ERR
cylc (scheduler - 2019-01-10T00:59:07Z): CRITICAL Task job script received signal ERR at 2019-01-10T00:59:07Z
cylc (scheduler - 2019-01-10T00:59:07Z): CRITICAL failed at 2019-01-10T00:59:07Z

Thanks,
Dani

comment:3 Changed 8 months ago by willie

Hi Dani,

In

./18500201T0000Z/coupled/01/job.err

You are getting

BUFFOUT: Write Failed: Disk quota exceeded

So you have run out of quota on ARCHER. Try removing previous runs that you no longer require. You can copy data to the RDF if necessary. Then try again.

Willie

comment:4 Changed 8 months ago by willie

  • Resolution set to fixed
  • Status changed from new to closed
Note: See TracTickets for help on using tickets.