Opened 4 months ago

Closed 3 months ago

#2366 closed help (fixed)

Error in coupled HadGEM3.1 run

Reported by: ajd Owned by: um_support
Priority: normal Component: UM Model
Keywords: Cc:
Platform: Monsoon2 UM Version: 10.7

Description

Hello CMS,

One of my coupled runs just failed with the following error (suite-id u-at674 running on MonSooN):

????????????????????????????????????????????????????????????????????????????????
?????????????????????????????? WARNING ??????????????????????????????
? Warning code: -80
? Warning from routine: PRELIM
? Warning message:
? Field - Section:3, Item:353 ignored.
? Invalid pseudo-level type.
? Warning from processor: 0
? Warning number: 10
????????????????????????????????????????????????????????????????????????????????

Error [ template <typename T> CBufferIn& operator>>(CBufferIn& buffer, T& type)] : In file '/var/spool/jtmp/29721.xcs00.6tyRqP/XIOS/src/type/type_ref_impl.hpp', line 225 → Not enough data in buffer to unqueue the data.

Rank 1014 [Wed Jan 17 12:25:48 2018] [c9-2c2s9n0] application called MPI_Abort(MPI_COMM_WORLD, -1) - process 1014

Error [ template <typename T> CBufferIn& operator>>(CBufferIn& buffer, T& type)] : In file '/var/spool/jtmp/29721.xcs00.6tyRqP/XIOS/src/type/type_ref_impl.hpp', line 225 → Not enough data in buffer to unqueue the data.

Rank 1086 [Wed Jan 17 12:25:48 2018] [c9-2c2s11n3] application called MPI_Abort(MPI_COMM_WORLD, -1) - process 1086
Application 14239019 is crashing. ATP analysis proceeding…

????????????????????????????????????????????????????????????????????????????????
?????????????????????????????? WARNING ??????????????????????????????
? Warning code: -4
? Warning from routine: UKCA_SURFDDR
? Warning message: Surface resistance values not found for 4 species.
? Warning from processor: 0
? Warning number: 11
????????????????????????????????????????????????????????????????????????????????

atpAppSigHandler timed out waiting for shutdown. Re-raising signal.
atpAppSigHandler timed out waiting for shutdown. Re-raising signal.
_pmiu_daemon(SIGCHLD): [NID 07279] [c9-2c2s11n3] [Wed Jan 17 12:30:50 2018] PE RANK 1086 exit signal Aborted
_pmiu_daemon(SIGCHLD): [NID 07268] [c9-2c2s9n0] [Wed Jan 17 12:30:50 2018] PE RANK 1014 exit signal Aborted
[NID 07279] 2018-01-17 12:30:50 Apid 14239019: initiated application termination
[FAIL] run_model # return-code=137
2018-01-17T12:30:57Z CRITICAL - Task job script received signal EXIT

I am not really sure what to make of this error. All my other runs which have pretty much the same setup are currently running, so I don't think it's necessarily a problem with the suite.

Have you seen this before?

Many thanks,
Andrea

Change History (5)

comment:1 Changed 4 months ago by grenville

Andrea

That's not one I've seen before. It's a problem in XIOS which writes NEMO data. Have you changed the NEMO output setup; has the model successfully output NEMO data? If the answers are no and yes then I suggest re-running the failed task.

Grenville

comment:2 Changed 4 months ago by ajd

Hi Grenville,

Thanks for your reply. No I haven't changed anything in the NEMO output setup and yes the model has previously successfully output NEMO data (also the other runs set up the same way don't have a problem). I've had to start an NRUN from 1867 following these instructions : https://code.metoffice.gov.uk/trac/ukcmip6/wiki/FailuresEncountered#Twocoupledtasksareeitherrunningretryingorfailed so I had to change the start dumps but not the output setup.

I've already tried re-running the failed task but it failed again straight away. Since the model has only run for two years since the NRUN I will try starting it again from that point.

Thanks,
Andrea

Last edited 4 months ago by ajd (previous) (diff)

comment:3 Changed 4 months ago by ajd

Hi Grenville,

I've tried several things including re-running but I still get this error. Is there anything else I can try?

Thanks,
Andrea

comment:4 Changed 4 months ago by ajd

Hello,

The problem seems to have been solved with an NRUN and running the model with the —new option, which required re-building the model.

Thanks,
Andrea

comment:5 Changed 3 months ago by willie

  • Resolution set to fixed
  • Status changed from new to closed
Note: See TracTickets for help on using tickets.