Opened 3 years ago
Closed 3 years ago
#2366 closed help (fixed)
Error in coupled HadGEM3.1 run
Reported by: | ajd | Owned by: | um_support |
---|---|---|---|
Component: | UM Model | Keywords: | |
Cc: | Platform: | Monsoon2 | |
UM Version: | 10.7 |
Description
Hello CMS,
One of my coupled runs just failed with the following error (suite-id u-at674 running on MonSooN):
????????????????????????????????????????????????????????????????????????????????
?????????????????????????????? WARNING ??????????????????????????????
? Warning code: -80
? Warning from routine: PRELIM
? Warning message:
? Field - Section:3, Item:353 ignored.
? Invalid pseudo-level type.
? Warning from processor: 0
? Warning number: 10
????????????????????????????????????????????????????????????????????????????????
Error [ template <typename T> CBufferIn& operator>>(CBufferIn& buffer, T& type)] : In file '/var/spool/jtmp/29721.xcs00.6tyRqP/XIOS/src/type/type_ref_impl.hpp', line 225 → Not enough data in buffer to unqueue the data.
Rank 1014 [Wed Jan 17 12:25:48 2018] [c9-2c2s9n0] application called MPI_Abort(MPI_COMM_WORLD, -1) - process 1014
Error [ template <typename T> CBufferIn& operator>>(CBufferIn& buffer, T& type)] : In file '/var/spool/jtmp/29721.xcs00.6tyRqP/XIOS/src/type/type_ref_impl.hpp', line 225 → Not enough data in buffer to unqueue the data.
Rank 1086 [Wed Jan 17 12:25:48 2018] [c9-2c2s11n3] application called MPI_Abort(MPI_COMM_WORLD, -1) - process 1086
Application 14239019 is crashing. ATP analysis proceeding…
????????????????????????????????????????????????????????????????????????????????
?????????????????????????????? WARNING ??????????????????????????????
? Warning code: -4
? Warning from routine: UKCA_SURFDDR
? Warning message: Surface resistance values not found for 4 species.
? Warning from processor: 0
? Warning number: 11
????????????????????????????????????????????????????????????????????????????????
atpAppSigHandler timed out waiting for shutdown. Re-raising signal.
atpAppSigHandler timed out waiting for shutdown. Re-raising signal.
_pmiu_daemon(SIGCHLD): [NID 07279] [c9-2c2s11n3] [Wed Jan 17 12:30:50 2018] PE RANK 1086 exit signal Aborted
_pmiu_daemon(SIGCHLD): [NID 07268] [c9-2c2s9n0] [Wed Jan 17 12:30:50 2018] PE RANK 1014 exit signal Aborted
[NID 07279] 2018-01-17 12:30:50 Apid 14239019: initiated application termination
[FAIL] run_model # return-code=137
2018-01-17T12:30:57Z CRITICAL - Task job script received signal EXIT
I am not really sure what to make of this error. All my other runs which have pretty much the same setup are currently running, so I don't think it's necessarily a problem with the suite.
Have you seen this before?
Many thanks,
Andrea
Change History (5)
comment:1 Changed 3 years ago by grenville
comment:2 Changed 3 years ago by ajd
Hi Grenville,
Thanks for your reply. No I haven't changed anything in the NEMO output setup and yes the model has previously successfully output NEMO data (also the other runs set up the same way don't have a problem). I've had to start an NRUN from 1867 following these instructions : https://code.metoffice.gov.uk/trac/ukcmip6/wiki/FailuresEncountered#Twocoupledtasksareeitherrunningretryingorfailed so I had to change the start dumps but not the output setup.
I've already tried re-running the failed task but it failed again straight away. Since the model has only run for two years since the NRUN I will try starting it again from that point.
Thanks,
Andrea
comment:3 Changed 3 years ago by ajd
Hi Grenville,
I've tried several things including re-running but I still get this error. Is there anything else I can try?
Thanks,
Andrea
comment:4 Changed 3 years ago by ajd
Hello,
The problem seems to have been solved with an NRUN and running the model with the —new option, which required re-building the model.
Thanks,
Andrea
comment:5 Changed 3 years ago by willie
- Resolution set to fixed
- Status changed from new to closed
Andrea
That's not one I've seen before. It's a problem in XIOS which writes NEMO data. Have you changed the NEMO output setup; has the model successfully output NEMO data? If the answers are no and yes then I suggest re-running the failed task.
Grenville