Opened 8 years ago

Closed 7 years ago

#662 closed help (fixed)

HadCM3 vn4.5 error forrtl: severe (104): incorrect STATUS= specifier value for connected file, unit 8, file /nobackup/eejop/work/um/xfzmn/tmp/xfzmn.pipe

Reported by: pliojop Owned by: simon
Component: UM Model Keywords: HadCM3, .pipe, error
Cc: Platform: <select platform>
UM Version: 4.5

Description

Hi,

I'm trying to run a HadCM3 simulation (xfzm-n on PUMA) and after 3 years off running it crashes with the following error in the .leave file

forrtl: severe (104): incorrect STATUS= specifier value for connected file, unit 8, file /nobackup/eejop/work/um/xfzmn/tmp/xfzmn.pipe

Image PC Routine Line Source

Hadley4.5.exec 00000000009977DD Unknown Unknown Unknown

Hadley4.5.exec 00000000009962E5 Unknown Unknown Unknown

Hadley4.5.exec 000000000092D809 Unknown Unknown Unknown

Hadley4.5.exec 00000000008D921F Unknown Unknown Unknown

Hadley4.5.exec 00000000008D8935 Unknown Unknown Unknown

Hadley4.5.exec 00000000008E3E27 Unknown Unknown Unknown

Hadley4.5.exec 0000000000478F2C meanctl_ 6409 meanctl1.f

Hadley4.5.exec 000000000041C0BC u_model_ 4687 u_model1.f

Hadley4.5.exec 000000000040B0D4 MAIN 3091 umshell1.f

Hadley4.5.exec 000000000040818C Unknown Unknown Unknown

libc.so.6 0000003037C1D994 Unknown Unknown Unknown

Hadley4.5.exec 0000000000408099 Unknown Unknown Unknown

MPI process terminated unexpectedly

This has stumped everyone I have asked so far as being an error they haven't seen before.

The xfzmn.pipe file is listed in the end of the .leave file as a created file, so whether this is a failure to read or a failure to open I am not sure. The .pipe file contains no important information, simply reading

WAKE UP

Any help would be much appreciated,

Cheers

James

Change History (11)

comment:1 Changed 8 years ago by simon

  • Owner changed from um_support to simon
  • Status changed from new to assigned

Hi,

I need to look at the full output. It appears to be a problem with mknod.
What machine are you running on?

Simon.

comment:2 Changed 8 years ago by isssjp

Hi,

James is overseas (British Geaographic Survey) at the moment and can't access the HelpDesk? so I will try and answer on his behalf. We are running on ARC1, which is the Leeds computing facility. We have ported the UM 4.5 scripts to ARC1 and have successfully run many different experiments before this.

I know that we have had many issues trying to get the ocean configuration working with job but don't know if this error is relatedd or not.

Which output file do you want to see and I will try and find it for you.

What do you think the problem is with mknod. I will try to adjust the script/code accordingly. status=REPALCE imples that the file must already exist.

Does the pipe file have to exist before it can continue. The pipe file is definitely mentioned in the directory listing but we just didn't know where/when it was created.

Thanks,
Steven

isssjp@…

comment:3 Changed 8 years ago by simon

I need the .leave file. I'm now not so sure that it's mknod. He doesn't seem to have post processing turned on so it shouldn't matter. What's a little strange is that it fell over
in the Climate meaning section of the code but it doesn't appear to be switched on in the
umui.

Simon.

comment:4 Changed 8 years ago by isssjp

Hi,

The .leave file is too large to attach. Do you want me to copy it to somewhere on puma or hector.

Steven (isssjp@…)

comment:5 Changed 8 years ago by simon

Hi,

Just put it somewhere on puma.

Simon.

comment:6 Changed 8 years ago by isssjp

Hi,

It is on puma in
/tmp/jp_xfzmn000.xfzmn.d11203.t092526.leave

Steven.

comment:7 Changed 8 years ago by simon

Hi,

Climate meaning is turned of in the job on puma but is turned on for
the run. Has the umui config been changed since the run was done?
There seems to be something very strange happening with the I/O
just before the crash. The climate meaning is set up for 10 day
dumping when the model has a thirty day dump. The crash after 3 years
is consistent with entering the 4th climate meaning period and somehow
failing when processing it.

I suggest rerunning making sure that climate meaning is turned off for the
atmos and ocean, or if it is required changing the dump frequency to 10 days
for both.

Simon.

comment:8 Changed 8 years ago by pliojop

Hi,

I set the model to irregular dumping and no meaning for both atmos and ocean components, and the model now ran for 6.1 years (88008 timesteps) before falling over with a new error.

The .leave file for this can now be found in teh /tmp file:

xfzmn000.xfzmn.d11222.t101522.leave

Cheers

James

comment:9 Changed 8 years ago by pliojop

Hi

I have continued to work on this issue.

After the irregular dumping, I turned all dumping and meaning off the model ran for 28.9 eyars before dying from some numerical instability of unknown origin. But with no dumps it was impossible to resubmit it from 28 years.

I then turned to the above suggestion of equalising the dumping times, so I have run

360 day dumps (Ocean and atmosphere) died 3.6 years .pipe error
720 timestep dumps (O and A) died 0.3 years .pipe error
720 day dumps (O and A) died 12 years .pipe error
30 day dumps (O and A) died 0.3 years .pipe error.

I have resubmitted the 720 day dump simulation, using the 10 year dumps, it has run past the 2nd year for it (and 12th overall for the simulation) I am however expecting it to die about 7pm tonight at year 12 of that simulation (22 in total).

I have put all the leave files on PUMA in a folder with each runs leave file identified as above in /home/pliojop/leave

Any further suggestions would be greatly appreciated.

Many Thanks

James

comment:10 Changed 8 years ago by simon

Hi,

I'm afraid I will be away until next week. I will look at this then.

Simon.

comment:11 Changed 7 years ago by ros

  • Platform set to <select platform>
  • Resolution set to fixed
  • Status changed from assigned to closed
Note: See TracTickets for help on using tickets.