Opened 5 years ago

Closed 5 years ago

#1350 closed help (fixed)

Job failure

Reported by: vanniere Owned by: um_support
Component: UM Model Keywords:
Cc: Platform: ARCHER
UM Version: 7.3

Description

Hi,

I have some problems running a simulation on Archer (xkhjb), which is almost identical to a simulation that was completed on Hector (xiypa). The former is a copy of the latter, in which I made all the necessary changes, following http://cms.ncas.ac.uk/wiki/Archer.

The run xkhjb stops with NaNs? after roughly 9 hours at time step 132. I have increased the time-stepping (from 120 to 360 ts/day) but the job is still failing.

I can't figure out what is the cause of the failure.
Thanks for your help.
Regards,

Benoit

.leave is in /home/n02/n02/vanniere/output/
output files for 360 ts/day are in /work/n02/n02/vanniere/xkhjb

Change History (10)

comment:1 Changed 5 years ago by willie

Hi Benoit,

The model becomes unstable at time step 131,

  GCR( 2 ) failed to converge in  200  iterations. 

This is sometimes resolved by halving the time step.

Regards

Willie

comment:2 Changed 5 years ago by vanniere

Hi Willie,

Thanks for your answer.
I have tried several time steps from 120 ts/day to 480 ts/day and the model is unstable after 9hours whatever happens.

Well before the simulation fails, the downward LW and SW flux outputs don't look like what they should be… Here are two output files respectively at the beginning of the simulation and after one hour. The first one looks fine but not the second one:
xconv /work/n02/n02/vanniere/xkhjb/xkhjba_pa000
xconv /work/n02/n02/vanniere/xkhjb/xkhjba_pa001

Do you have any idea of what could be the cause?
Kind regards,
Benoit

Last edited 5 years ago by vanniere (previous) (diff)

comment:3 Changed 5 years ago by grenville

Benoit

When you say "which is almost identical to a simulation that was completed on Hector (xiypa)" - what does that mean? In what way do they differ?

Grenville

comment:4 Changed 5 years ago by vanniere

Hi Grenville,

I think that it is easier to compare xkhjb (which is not running) with xjbjb than xiypa. xjbjb ran well.
I copied and pasted xjbjb and renamed it xkhjb and made the following changes :

  • I made the changes related to Archer (machine name, overrides pathes…)
  • I compiled a new executable
  • I modified the number of North-South processors : 16 —> 12
  • I tried various combination of time-stepping including 120ts/day like in xjbjb
  • I created a new start dumps : xjbja —> xkhja (the new start dump looks perfectly fine)
  • I modified some STASH.

Thanks for your help,
Benoit

comment:5 Changed 5 years ago by grenville

Hi Benoit

Sorry for the slow reply - waiting for inspiration.

I can get your job to run OK by reducing the compiler optimisation level to -O0 for all routines. The trick now is to find which routine in particular is causing the problem and to maintain that low level of optimization on that routine only - running low optimization throughout will be very inefficient.

My copy of your job is xinox.

Playing with optimisations can be quite time consuming - we can advise, but would hope that you could do the tests?

Grenville

comment:6 Changed 5 years ago by grenville

That should be umui job xinoz.

Grenville

comment:7 Changed 5 years ago by vanniere

Dear Grenville,

Thanks for your answer.

I see that the problem is still there in xinoz :

In /work/n02/n02/grenvill/xinoz/xinoza_pa000, NET DN SW and NET DN LW look fine, but they don't in /work/n02/n02/grenvill/xinoz/xinoza_pa001.
It is particularly visible on the output map of NET DN LW where stripes appear.

Do you think that it could be related to the failure?

Many thanks,
Benoit

comment:8 Changed 5 years ago by grenville

I think these fields are only valid on radiation time steps - in your case radiation is only called 8x/day. If you load up more times, you'll see good data at -.125, 0.25,…days.

Grenville

comment:9 Changed 5 years ago by vanniere

Hi Grenville,

Thanks for your help. It is now running well.
I kept the optimisation level -O0 for all routines, as it is efficient enough like that.
I think you can now close this ticket.

Benoit

comment:10 Changed 5 years ago by annette

  • Resolution set to fixed
  • Status changed from new to closed
Note: See TracTickets for help on using tickets.