Opened 5 years ago

Closed 5 years ago

#1304 closed error (answered)

Model failure in the second month of the run

Reported by: till Owned by: um_support
Component: UM Model Keywords: qsserver problem
Cc: Platform: MONSooN
UM Version: 8.6

Description

Hi there,

I'm trying to run a coupled N96-ORCA1 configuration on MONSooN with updated NEMO settings (namelist and fpp keys file).

While the first month of XJPDL runs fine, it unfortunately fails at the beginning of the second month. The .leave file is /home/tkuhlb/output/xjpdl000.xjpdl.d14148.t130148.leave. It says:

qsserver: Waiting for command 10
ERROR: 0031-250 task 214: Segmentation fault
ERROR: 0031-250 task 213: Segmentation fault

and then a lot more Segmentation faults.

The purpose of qsserver is to archive data to another machine. It uses the files .next_command and $RUNID.requests. These contain:

.next_command:
%%% /projects/ukesm/tkuhlb/xjpdl/xjpdla.ph1978sep DELETE

$RUNID.requests:
WAKEUP
%%% xjpdla.pa1978sep ARCHIVE PPNOCHART
%%% xjpdla.pa1978sep DELETE
%%% xjpdla.pd1978sep ARCHIVE PPNOCHART
%%% xjpdla.pd1978sep DELETE
%%% xjpdla.pe1978sep ARCHIVE PPNOCHART
%%% xjpdla.pe1978sep DELETE
%%% xjpdla.ph1978sep ARCHIVE PPNOCHART
%%% xjpdla.ph1978sep DELETE

qsserver is called by qsatmos, with the variable ${UM_COMMS_FILE}. Now, in qsatmos:
UM_COMMS_FILE=$UM_TMPDIR/$RUNID.comms.$$ . Checking /scratch/tkuhlb/xjpdl/xjpdl.comms.4194882 shows that this file points to exactly the same files as $RUNID.requests above.

Anyway, the strange thing is that /scratch/tkuhlb/xjpdl/xjpdl.comms.4194882 has nine lines, and qsserver creates the segmentation fault while waiting for command 10, i.e., as far as I understand, the 10th line of /scratch/tkuhlb/xjpdl/xjpdl.comms.4194882. Why doesn't qsserver realize that xjpdl.comms.4194882 has only got 9 lines? Would you have a clue?

Some additional information, maybe it helps: at the very end of the ocean.output file (within the .leave file), I find:

dta_tsd_init : Temperature & Salinity data

Namelist namtsd
Initialisation of ocean T & S with T &S input data ln_tsd_init = T
damping of ocean T & S toward T &S input data ln_tsd_tradmp = F
==⇒>> : W A R N I N G
===============
dta_tsd_init: ocean restart and T & S data intialisation,
we keep the restart T & S values and set ln_tsd_init to FALSE

Many thanks for your help!

Change History (3)

comment:1 Changed 5 years ago by till

Hi there,
I just got a clue from Chris Harris regarding this error. It could be the NEMO namelist setting for the frequency of the restart files. I'm trying now a different setting for this frequency - hope this solves the problem. I'll update this ticket further.

comment:2 Changed 5 years ago by till

Hi there,
this ticket can be closed now - Chris Harris' clue was the right one.
Cheers
Till

comment:3 Changed 5 years ago by grenville

  • Resolution set to answered
  • Status changed from new to closed
Note: See TracTickets for help on using tickets.