Opened 9 years ago

Closed 9 years ago

#626 closed error (fixed)

Error in vn6.1 HiGEM job: ACUMPS: Data corruption during I/O

Reported by: swrshaff Owned by: um_support
Component: UM Model Keywords: ACUMPS
Cc: Platform:
UM Version: 6.1

Description

Hi,

I'm getting the following error in the job xgaqg

UM ERROR (Model aborting) :

Routine generating error: U_MODEL
Error code: 4
Error message:

ACUMPS: Data corruption during I/O


Looking at the NCAS-CMS website this error appears to be associated with problematic STASH diagnostics. I've just changed large sections of the STASH (as this is a CMIP5 job) so is there a way of working out which STASH diagnostic is causing the problem?

I'm running with the pum_full_6.1.mf77 modset but that doesn't seem to be giving me any information on where the STASH might be going wrong.

Any thoughts on how to further diagnose the problem?

Many thanks

Len

Change History (11)

comment:1 Changed 9 years ago by swrshaff

I can't add the leave file as an attachment (too large), the file can be found on HECToR at

/home/n02/n02/lcs/um/umui_out/xgaqg000.xgaqg.d11143.t111556.leave

Len

comment:2 Changed 9 years ago by willie

Hi Len,

One problem is that the user STASH master file doesn't exist:


~/umui_jobs/pst/ajurx/atmos/userstashmaster_cloud_rad_diags_on_timesteps

If you do "check setup" in the UMUI it complains about several time profiles: TTRIFphn and T24HRMRV.

Regards,

Willie

comment:3 Changed 9 years ago by willie

Hi Len

It's on PUMA in

~umui/hadgem2/userstash/userstashmaster_cloud_rad_diags_on_timesteps

Regards,

Willie

comment:4 Changed 9 years ago by swrshaff

Thanks Willie,

I'll implement those changes and rerun the job. I'll let you know if the problem reoccurs or not.

Len

comment:5 Changed 9 years ago by swrshaff

Hi,

I'm afraid that despite changing the STASH (removing the missing STASHmaster file and altering the errant Time profiles) the ACDUMPS errors still remains.

Is there any way of further diagnosing the problem with STASH?

The new .leave file is on HECToR at

/home/n02/n02/lcs/um/umui_out/xgaqg000.xgaqg.d11144.t115936.leave

Many thanks

Len

comment:6 Changed 9 years ago by willie

  • Keywords ACUMPS added

Hi Luke,

The Error 4, ACUMPS means that the partial sum files have become corrupt. This can be due to incorrect initial data or to overwriting.

There are several things you can do.

  1. In the output choices panel, switch on STASH messages,
  2. add my modset /home/n02/n02/wmcginty/flush.mf77 - this will give some output when the run crashes
  3. Obviously, check your initial data.
  4. resubmit the run using a previous dump.

Regards,

Willie

comment:7 Changed 9 years ago by lois

Hello Len,

I was just about to say the same as Willie that it is the checksum process for the climate meaning restart files which is causing the problem, not quite sure why though.

Looking at your disk space it is a bit limit so I have upped your allocation in case you are so near the limit this is causing the file problems.

Willie's suggestion is sound why not try restarting from the last dump and see if the problem is still there, I hope that it will go through.

Lois

comment:8 Changed 9 years ago by swrshaff

Willie, Lois,

Many thanks for the suggestions. I'll see if the additional output provides some info on what is going wrong.

I'll keep you informed.

Best wishes

Len

comment:9 Changed 9 years ago by swrshaff

Just a quick update. Including the STASH messages in the job was very useful thing to do. In the output there is the following message

WRITING UNIFIED MODEL DUMP ON UNIT 22
#####################################

Data successfully written
231852543 words written to unit 22
(Observational data)
MEANCTL: * Called in ATMOSPHERIC mode *

MEAN_OFFSET( 1 )= 3

MEANCTL: Period_ 4 mean not activated because of staggered start in means production
Period_1 data read from unit number 23
Period_1 data written to unit number 24
WARNING: checksum detects a corruption
in STASH section 26
item number 1

So it looks like 26 001 causing the problem. I'll try removing/modifying this and rerunning xgaqg

comment:10 Changed 9 years ago by swrshaff

Running with a modified time profile for the offending STASH diagnostic resolved the ACUMPS problem with corrupted partial sums.

For reference, the solution was to rurun the model with the STASH messaging turned on to find out which STASH diagnostic was causing the problem.

Many thanks for your help in sorting this problem out.

Best wishes

Len

comment:11 Changed 9 years ago by willie

  • Resolution set to fixed
  • Status changed from new to closed
Note: See TracTickets for help on using tickets.