Opened 9 years ago

Closed 9 years ago

#771 closed help (wontfix)

Error reading dump headers

Reported by: mx020105 Owned by: willie
Component: UM Model Keywords:
Cc: Platform:
UM Version: 6.6.3

Description

Hi helpdesk,

I have set a job up (xgvka) which was originally copied across from MONSooN. The job compiles successfully, but when I try to run it I get the error message (see /home/n02/n02/mx020105/um/umui_out/xgvka000.xgvka.d12011.t095741.leave):

CLOSE: File /work/n02/n02/mx020105/xgvka/xgvka.astart Closed on Unit 21
gc_abort (Processor 225 ): Error reading dump header

which is then repeated for several processors.

I thought this might be a problem with the startdumps, but I have checked that these are both 64-bit byte swapped ieee files. So I'm not sure that this is the problem.

Any ideas you have would be very helpful.

Many thanks,
Amanda

Change History (11)

comment:1 Changed 9 years ago by mx020105

Hi helpdesk,

Sorry to pester you, but has anyone had the chance to look at this? I haven't managed to make any further progress on the issue.

Thanks in advance,
Amanda

comment:2 Changed 9 years ago by willie

Hi Amanda,

Could you give me read permission on the core file please?

Regards,

Willie

comment:3 Changed 9 years ago by mx020105

Hi Willie,

Sorry to be dense, but what do you mean by core file?

Thanks
Amanda

comment:4 Changed 9 years ago by willie

Hi Amanda, In your work directory for this job there is a file called 'core' which is produced (sometimes) when the program crashes. It can contain useful debugging information. I'll look at it with a program called totalview:

totalview <path to exec> core

Willie

comment:5 Changed 9 years ago by mx020105

Ah OK, I see. Sorry, I'm not up-to-date with the new system since the changeover to fcm and phase 3 etc.

You should have permissions now.

Thanks,
Amanda

comment:6 Changed 9 years ago by willie

Hi Amanda,

Yes, that was helpful: your ocean start dump appears to be corrupt. It crashes xconv when I try to read it. You'll need to re-create this, or copy it from its original source. You can do a checksum on a file by typing

sum -r <filename>

This can help verify the copy.

Regards,

Willie

comment:7 Changed 9 years ago by willie

Hi Amanda,

I am wrong about the ocean dump. If you

export MALLOC_CHECK_=0

and then run xconv, you can see that the ocean dump is readable. Certainly it is complaining about the ocean dump, but it seems OK. You could try reducing the number of processors to 8EW x 12 NS.

There is also a UMUI check set up error: the start year needs to be 2101 in the HFC134A panel.

Regards,

Willie

comment:8 Changed 9 years ago by mx020105

Hi Willie,

I've made the changes you suggested and recompiled and re-run the code.

This time I get an error:

sys-2 : UNRECOVERABLE error on system request

No such file or directory

Encountered during an OPEN of unit 56
Fortran unit 56 is not connected

which repeats a couple of dozen times.

I've searched for this in the old helpdesk tickets, but it doesn't seem to have been noted there.
Any advice you have would be gratefully received.

Thanks,
Amanda

Thanks,
Amanda

comment:9 Changed 9 years ago by willie

Hi Amanda,

We made a small change to the processor configuration and this has radically changed the type of error. Check setup is now complaining about broken codes in the user STASH epflux606 where it didn't before. I think it would be a good idea to review the differences between this job and Dann's xgpba, which is presumably working on HECToR. You can see the differences by using the UMUI's job difference facility: just enter Dann's user name separated from yours by a comma (no spaces) highlight both jobs and compare.

Unit 56 might be (!) associated with FLUXCORR via the file clfhist.h (It gets repeated once for each processor). This may be a clue.

If you do another run, it would be a good idea to delete the core file first to make sure that a fresh one is not prevented.

I hope that helps.

Regards,

Willie

comment:10 Changed 9 years ago by willie

  • Owner changed from um_support to willie
  • Status changed from new to assigned

comment:11 Changed 9 years ago by willie

  • Resolution set to wontfix
  • Status changed from assigned to closed
Note: See TracTickets for help on using tickets.