Opened 3 years ago

Closed 3 years ago

#2397 closed help (fixed)

Coupled model failing after DAC login problem

Reported by: mvguarino Owned by: um_support
Component: UM Model Keywords: RDF, BiCGSTAB, convergence
Cc: Platform: ARCHER
UM Version: 10.7



Yesterday morning I realized that my suite u-au022 stopped running last Friday (9 Feb) at cycle 18840401T0000Z.
On Friday problems with the DAC login node were reported (I couldn't indeed log into the RDF), and the rose suite is set to store the outputs in the RDF.

I restarted the suite but the coupled task failed saying ./amos.exe was the problem, although the postproc_atmos ran successfully and I could find the corresponding model outputs in History.
I tried a warm restart from the cycle where the suite stopped:
rose suite-run — —warm 18840401T0000Z
as that didn’t work, I then tried a warm restart from the penultimate cycle:
rose suite-run — —warm 18840301T0000Z .

The coupling still doesn't seem to be able to run and keeps on failing. This is the error job.err file:

Any idea why this is happening?
Could this indeed be linked to the DAC login node problems of last Friday?

Thank you,


Change History (13)

comment:1 Changed 3 years ago by grenville

Please change ARCHER permissions so we can read files

chmod -R g+rX /home/n02/n02/<your-username>
chmod -R g+rX /work/n02/n02/<your-username>

comment:2 Changed 3 years ago by mvguarino


You should now be able to read files in both home/ and work/ .



comment:3 Changed 3 years ago by grenville

Vittoria - the job.err file explains the error — this is a numerical stability problem & quite difficult to track down. As a first step, try switching on extra diagnostic output ( we did that in the training) and do a rose suite-run —restart & retrigger the failed task.

Error code: 1

? Error from routine: EG_BICGSTAB
? Error message: Convergence failure in BiCGstab, omg is NaN
? This is a common point for the model to fail if it
? has ingested or developed NaNs? or infinities

comment:4 Changed 3 years ago by mvguarino

Hi Grenville,

I tried a couple of things in the meantime and every time it failed for a different reason, now it is a NaNs? problem but at least it runs again until 18840401T0000Z, where it stopped the first time.
I was thinking to restart the model perturbing the latest UM restart as described here:

However, I was wondering if I should restart from january (for climate meaning) or from the cycle where it sopped (18840401T0000Z). Could you advise ?



comment:5 Changed 3 years ago by mvguarino

Update: I can't find the Mule script '' on Archer.
I am looking into /work/y07/y07/umshared/lib/python2.7/ , should I look somewhere else?



P.S. I also restarted the suite with extra diagnostic messages for the atmosphere, but job.err didn't change.

Last edited 3 years ago by mvguarino (previous) (diff)

comment:6 Changed 3 years ago by willie

  • Keywords RDF, BiCGSTAB, convergence added; RDF removed
  • UM Version changed from <select version> to 10.7

Hi Vittoria,

I have copied the moci tools to /home/n02/n02/wmcginty. I was using these for another BiCGSTAB problem #2386 - it didn't work there. Since your run is failing a good way through, I would advise halving the time step in the first instance before trying the perturbation method.


comment:7 Changed 3 years ago by mvguarino

Hi Willie,

I have 72 atmosphere timesteps per day, and 32 ocean/sea ice timesteps per day.
Do you suggest to halve both of them?


comment:8 Changed 3 years ago by mvguarino

also, I don't seem to have permissions to access /home/n02/n02/wmcginty

comment:9 Changed 3 years ago by willie

Hi Vittoria,

Since the problem occurs in the atmosphere model, I would just halve the time step there.

I have changed permissions for you.


comment:10 Changed 3 years ago by mvguarino

Thank you.
I went ahead with the perturbation method: au022a.da18840401_00_orig --output ./au022a.da18840401_00

I also made sure the suite would restart from the start of the cycle modifying the au022.xhist file and etc (see here:

I now get this error:

???!!!???!!!???!!!???!!!???!!!       ERROR        ???!!!???!!!???!!!???!!!???!!!
?  Error code: 2
?  Error from routine: U_MODEL_4A
?  Error message: ACUMPS1: Partial sum file inconsistent. See Output
?  Error from processor: 2
?  Error number: 210

I don't know what a 'partial sum file' is however, and what to do about it.



comment:11 Changed 3 years ago by mvguarino


I solved the problem as follows:

Using the moci tool ‘‘ I perturbed the restart dump of the January of the year in which the cycle failed: au022a.da18840101_00_orig --output ./au022a.da18840101_00

I then restarted the model with a NRUN (rose suite-run) using consistent UM, CICE and NEMO restart dumps.

Nonetheless, I got the "partial sum file inconsistent" error because, I found out, the previous partial sum files were not overwritten by the new cycle (they were still the ones created on 9 February when the simulation crashed the first time).

I deleted manually all the suiteID_s* and cycleID_suiteID_s* files in History_Data/ and restarted the suite again from January.

The suite is running now.



P.S. I am not sure the DAC login node problem had anything to do with it, it may have been just a coincidence. I tried to change the title of the ticket but I can't.

Last edited 3 years ago by mvguarino (previous) (diff)

comment:12 Changed 3 years ago by grenville


That's great!


comment:13 Changed 3 years ago by grenville

  • Resolution set to fixed
  • Status changed from new to closed
Note: See TracTickets for help on using tickets.