Opened 5 years ago

Closed 4 years ago

#1451 closed help (answered)

Model long run crashed

Reported by: fcentoni Owned by: annette
Component: UKCA Keywords:
Cc: Platform: ARCHER
UM Version: 7.3

Description

Hi,

I have been running without any problem two jobs (xkzra,xkzrb) (10 years run lenght) since 2 days ago.
At some point, in one of the last run, the model crashed giving me this error (see output files xkzra004.xkzra.d15027.t220300.leave
xkzrb004.xkzrb.d15027.t220708.leave)

UM ERROR (Model aborting) :
 Routine generating error: U_MODEL
 Error code:  1
 Error message:
DUMPCTL : Fail to open output dump - may already exist

Could you help me out to fix that (there only 2 years to go).

Many thanks.
Federico

Change History (7)

comment:1 Changed 5 years ago by fcentoni

Hi,

I have fixed the problem.

Thank you.
Federico.

comment:2 Changed 5 years ago by annette

Hi Federico,

Do you know what the problem was or what fixed it?

Annette

comment:3 Changed 5 years ago by fcentoni

Hi Annette,

it turned out that the model had produced a huge amount of dump files causing the run to crash.
So I basically manually deleted the most part of the dump files and restarted the run using of the last start dump available. Then it got back running until the end.

However, I tried to restart the same base job (I made a copy of the original so now it called 'xkjti') running the same code, same ancillaries and it turned that the model crashes during running after 15 mins. It gives me an error I am not familiar with at all. As you can read in the .leave file, this error is:

UM ERROR (Model aborting) :
 Routine generating error: U_MODEL
 Error code:  4
 Error message:
ACUMPS: Data corruption during I/O
 *********************************************************************************
Rank 0 [Sun Feb  1 21:32:49 2015] [c2-0c1s14n3] application called MPI_Abort(MPI_COMM_WORLD, 9) - process 0
_pmiu_daemon(SIGCHLD): [NID 00507] [c2-0c1s14n3] [Sun Feb  1 21:32:49 2015] PE RANK 0 exit signal Aborted
[NID 00507] 2015-02-01 21:32:49 Apid 12830340: initiated application termination
diff: /work/n02/n02/fcentoni/tmp/tmp.mom3.17090/xkjti.xhist: No such file or directory
qsexecute: Copying /work/n02/n02/fcentoni/um/xkjti/xkjti.thist to backup thist file /work/n02/n02/fcentoni/um/xkjti/xkjti.thist_keep
xkjti: Run failed

I would need to get my base job running as soon as possible so your feeback would be much appreciated.

Many thanks.
Federico.

comment:4 Changed 5 years ago by fcentoni

Sorry,

I meant that I tried to rerun the same job (10 years run length) now called 'xkjti'.

Thanks,
Federico.

comment:5 Changed 5 years ago by annette

  • Owner changed from um_support to annette
  • Status changed from new to assigned

comment:6 Changed 5 years ago by annette

Frederico,

I looked at this last week but there was no job 'xkjti' in the UMUI, either because it's not there or you've made it hidden.

Annette

comment:7 Changed 4 years ago by annette

  • Resolution set to answered
  • Status changed from assigned to closed

As this ticket has been inactive for some time, I'm going to close it, but you can re-open or post a new ticket if you have further issues.

Best regards,
Annette

Note: See TracTickets for help on using tickets.