Opened 6 years ago
Closed 6 years ago
#1451 closed help (answered)
Model long run crashed
Reported by: | fcentoni | Owned by: | annette |
---|---|---|---|
Component: | UKCA | Keywords: | |
Cc: | Platform: | ARCHER | |
UM Version: | 7.3 |
Description
Hi,
I have been running without any problem two jobs (xkzra,xkzrb) (10 years run lenght) since 2 days ago.
At some point, in one of the last run, the model crashed giving me this error (see output files xkzra004.xkzra.d15027.t220300.leave
xkzrb004.xkzrb.d15027.t220708.leave)
UM ERROR (Model aborting) : Routine generating error: U_MODEL Error code: 1 Error message: DUMPCTL : Fail to open output dump - may already exist
Could you help me out to fix that (there only 2 years to go).
Many thanks.
Federico
Change History (7)
comment:1 Changed 6 years ago by fcentoni
comment:2 Changed 6 years ago by annette
Hi Federico,
Do you know what the problem was or what fixed it?
Annette
comment:3 Changed 6 years ago by fcentoni
Hi Annette,
it turned out that the model had produced a huge amount of dump files causing the run to crash.
So I basically manually deleted the most part of the dump files and restarted the run using of the last start dump available. Then it got back running until the end.
However, I tried to restart the same base job (I made a copy of the original so now it called 'xkjti') running the same code, same ancillaries and it turned that the model crashes during running after 15 mins. It gives me an error I am not familiar with at all. As you can read in the .leave file, this error is:
UM ERROR (Model aborting) : Routine generating error: U_MODEL Error code: 4 Error message: ACUMPS: Data corruption during I/O ********************************************************************************* Rank 0 [Sun Feb 1 21:32:49 2015] [c2-0c1s14n3] application called MPI_Abort(MPI_COMM_WORLD, 9) - process 0 _pmiu_daemon(SIGCHLD): [NID 00507] [c2-0c1s14n3] [Sun Feb 1 21:32:49 2015] PE RANK 0 exit signal Aborted [NID 00507] 2015-02-01 21:32:49 Apid 12830340: initiated application termination diff: /work/n02/n02/fcentoni/tmp/tmp.mom3.17090/xkjti.xhist: No such file or directory qsexecute: Copying /work/n02/n02/fcentoni/um/xkjti/xkjti.thist to backup thist file /work/n02/n02/fcentoni/um/xkjti/xkjti.thist_keep xkjti: Run failed
I would need to get my base job running as soon as possible so your feeback would be much appreciated.
Many thanks.
Federico.
comment:4 Changed 6 years ago by fcentoni
Sorry,
I meant that I tried to rerun the same job (10 years run length) now called 'xkjti'.
Thanks,
Federico.
comment:5 Changed 6 years ago by annette
- Owner changed from um_support to annette
- Status changed from new to assigned
comment:6 Changed 6 years ago by annette
Frederico,
I looked at this last week but there was no job 'xkjti' in the UMUI, either because it's not there or you've made it hidden.
Annette
comment:7 Changed 6 years ago by annette
- Resolution set to answered
- Status changed from assigned to closed
As this ticket has been inactive for some time, I'm going to close it, but you can re-open or post a new ticket if you have further issues.
Best regards,
Annette
Hi,
I have fixed the problem.
Thank you.
Federico.