Opened 10 years ago

Closed 10 years ago

#580 closed help (fixed)

interim history file deleted due to failure writing partial sum files

Reported by: jonathan Owned by: lois
Component: UM Model Keywords:
Cc: Platform:
UM Version: <select version>

Description

Dear helpdesk

Do you know what could have caused my run xdnyp to crash with errors like this on phase2a in the fort6 output:

 MEANCTL: RESTART AT PERIOD_ 0
 U_MODEL: interim history file deleted due to failu     re writing partial sum files
 *********************************************************************************
 Model aborted with error code -    2 Routine and message:-
                                                    READDUMP: BAD BUFFIN OF DATA *********************************************************************************

This happened on 29th Jan, and today it's happened again, at a different point in the run. I suspect it might be to do with disk space, but I don't appear to be near my quota. What else might it indicate?

Thanks

Jonathan

Change History (4)

comment:1 Changed 10 years ago by lois

  • Owner changed from um_support to lois
  • Status changed from new to accepted

Hello Jonathan,

There don't appear to be quota problems both for the n02 group and for you but then I can only see the daily summary and there could be occasions (when the large CASCADE run is running) when NCAS jobs are competeing for disk space. If your job runs now with no changes then everything should be ok but if you record date and time of any urther problems we could ask HECToR, who may have more detailed disk statistics, to investigate.

Lois

comment:2 Changed 10 years ago by jonathan

Dear Lois

Problems continue with this job. It was running fine for hundreds of FAMOUS years until 27 Jan, and since then it's been constantly going wrong. I've done NRUNs several times from progressively earlier points to repeat parts which had worked before, but it doesn't work. The latest two crashes have been that it timed out and was killed. Looking at the output listing and the dates of dumps, it appears that it must be running about half the speed it should, and that's why it timed out. I wonder if this is connected with the behaviour of the /work filesystem, which now seems incredibly slow. For instance, it just took 3 min 40 sec to do ls -l /work/n02/n02/gregoryj/xdnyp, which contains about 70 files. Apart from the model not working, this behaviour is making it very difficult to manage my files.

Have I missed some important instructions about the rearrangement of HECToR nodes or disks? I believe I am logging onto phase2a and submitting the job to phase2a. Is that right?

I'd be very grateful for any advice. This is perplexing.

Jonathan

comment:3 Changed 10 years ago by lois

  • Status changed from accepted to assigned

It looks as though disk jitter, the variable performance when writing data, is becoming a serious issue again. Before we had all the downtime when HECToR merged the 2 lustre files systems attached to phase 2a and 2b this jitter was a large as 40% which is totally unacceptable.

I will raise the issue with both HECToR and Cray as a matter of urgency as you are not the only person who is experiencing serious problems.

Lois

comment:4 Changed 10 years ago by lois

  • Resolution set to fixed
  • Status changed from assigned to closed

The good news is that both HECToR and Cray have responded promptly to our complaint.

So now they are monitoring jobs that fail because of running out of time which may eventually be attributable to something like heavy disk activity. They are also monitoring /work usage more closely.

They are exploring the option of isolating some /work disk space so that heavy usage could be separated from light usage to stop everyone suffering when very data intensive jobs are running. This is not implemented yet and plans are still in the early stages.

HECToR are removing 'colouring' as a default on the ls command (which you can do now by using \ls) which will hopefully speed up everyone's use of /work.

Hopefully the monitoring and the small tweaks will help us all.

I will close this query now but do keep a note of when things get bad again and we will ask HECToR analyse their monitoring records.

Lois

Note: See TracTickets for help on using tickets.