Opened 12 years ago

Closed 12 years ago

#128 closed help (fixed)

UM ERROR in U_MODEL (error code 4)

Reported by: alexrap Owned by: jeff
Component: UM Model Keywords:
Cc: alex@… Platform:
UM Version:

Description

I'm getting an error for my xcrwl job that I cannot understand. I don't know what "error code 4" refers to, so it's very hard to know where to start debugging.

The only modifications from another job that completed fine are the additions of some new variables in a model mod. I think that I declared all of them properly.

Bellow are some lines from the .leave file, that I also attach to this ticket.

*

Job started at : Sat Apr 12 23:06:41 BST 2008
Run started from UMUI
Running from control files in /hpcx/home/n02/n02/alexrap/umui_runs/xcrwl-102172340

xcrwh with IWM diagnosticated
This job is running on machine l7f409,
using UM directory /hpcx/home/n02/n02/umx,
and test directory /hpcx/home/n02/n02/umx/umtest.
*

Starting script : qsexecute
Starting time : Sat Apr 12 23:06:42 BST 2008

*

/hpcx/tmpchkpt/jtmp/l1f401.331373.0/tmp/modscr_xcrwl/qsexecute: Executing setup

/hpcx/home/n02/n02/umx/vn6.1/normal/scripts/qssetup: Job terminated normally

/hpcx/tmpchkpt/jtmp/l1f401.331373.0/tmp/modscr_xcrwl/qsexecute: Executing dump reconfiguration program /hpcx/devt/n02/n02-ncas/alexrap/xcrwl/agodc.recon

ATTENTION: 0031-408 16 tasks allocated by LoadLeveler?, continuing…
xcrwl: Starting run
ATTENTION: 0031-408 16 tasks allocated by LoadLeveler?, continuing…

*
UM ERROR (Model aborting) :
Routine generating error: U_MODEL
Error code: 4
Error message:

ACUMPS: Data corruption during I/O

*

ERROR: 0031-250 task 0: IOT/Abort trap
ERROR: 0031-250 task 2: Terminated
ERROR: 0031-250 task 1: Terminated
ERROR: 0031-250 task 3: Terminated
ERROR: 0031-250 task 4: Terminated
ERROR: 0031-250 task 5: Terminated
ERROR: 0031-250 task 6: Terminated
ERROR: 0031-250 task 7: Terminated
ERROR: 0031-250 task 8: Terminated
ERROR: 0031-250 task 9: Terminated
ERROR: 0031-250 task 10: Terminated
ERROR: 0031-250 task 11: Terminated
ERROR: 0031-250 task 12: Terminated
ERROR: 0031-250 task 13: Terminated
ERROR: 0031-250 task 14: Terminated
ERROR: 0031-250 task 15: Terminated
diff: /hpcx/tmpchkpt/jtmp/l1f401.331373.0/tmp/xcrwl.xhist: A file or directory in the path name does not exist.
qsexecute: Copying /hpcx/devt/n02/n02-ncas/alexrap/xcrwl/xcrwl.thist to backup thist file /hpcx/devt/n02/n02-ncas/alexrap/xcrwl/xcrwl.thist_keep
xcrwl: Run failed
*

Ending script : qsexecute
Completion code : 134
Completion time : Sun Apr 13 02:11:46 BST 2008

*

Change History (7)

comment:1 Changed 12 years ago by jeff

  • Owner changed from um_support to jeff
  • Status changed from new to assigned

Hi Alex

You haven't attached the .leave file to the ticket, perhaps doing that is what made the web browser freeze. Could you tell me where the file is and make sure I have permission to read it.

Jeff.

comment:2 Changed 12 years ago by alexrap

Hi Jeff,

The .leave file is

/hpcx/devt/n02/n02-ncas/alexrap/um/umui_out/xcrwl000.xcrwl.d08102.t171827.leave

Alex.

comment:3 Changed 12 years ago by jeff

I don't have access to directory /hpcx/devt/n02/n02-ncas/alexrap

chmod -R g+rX /hpcx/devt/n02/n02-ncas/alexrap

should fix the problem.

Jeff.

comment:4 Changed 12 years ago by alexrap

Done it.

comment:5 Changed 12 years ago by jeff

In your .leave file there is this error message towards the end of the file

WARNING: checksum detects a corruption
in STASH section 2
item number 327
MEANCTL: RESTART AT PERIOD_ 0
U_MODEL: interim history file deleted due to failu re writing partial sum files
*
UM ERROR (Model aborting) :
Routine generating error: U_MODEL
Error code: 4
Error message:

ACUMPS: Data corruption during I/O

*

So the error is caused by stash code 2,327 which is user stash field "Saturation presure (t_contr)" is this one you added? Looking at your dumps this field seems to be only missing data, is that right? The error message is because the field on disk has stored a checksum and when this field is read in and the checksum recalculated, it doesn't agree with the stored version. I'm not sure why it does this, I would have to run your model with added print statements to find out why.

As this field doesn't contain anything useful you could turn it off in STASH and that should solve the problem. I suspect there is something wrong in the calculation of this field.

If you want me to look into this further let me know.

Jeff.

comment:6 Changed 12 years ago by alexrap

Yes, the user stash field 2,327 is one of the newly added ones. I don't want to switch it off as I need it diagnosticated for this run.

I made some modifications into the model mod that calculates it and resubmitted the job now.

I will let you know the outcome of the result.

comment:7 Changed 12 years ago by jeff

  • Resolution set to fixed
  • Status changed from assigned to closed
Note: See TracTickets for help on using tickets.