Opened 3 months ago

Closed 3 months ago

#3126 closed help (fixed)

Reconfiguration failure

Reported by: amenon Owned by: um_support
Component: UM Model Keywords: reconfiguration error, ancil file mismatch
Cc: Platform: ARCHER
UM Version: 10.9

Description

Hi,

One of my suites (u-bq450 in Archer) is failing at the reconfiguration job. This is the warning I am getting :

??????????????????????????????      WARNING       ??????????????????????????????
?  Warning code: -10
?  Warning from routine: ANCIL_CHECK_GRID_STAGGER
?  Warning message: Ancil file mismatch in fixed header(9) grid stagger value
?          Model grid stagger = 6
?          Ancil file grid stagger = 2
?          Ancil file path = /work/y07/y07/umshared/ancil/atmos/n1280e/hydrol_lsh/hydro1k/v1/qrparm.hydtopsd
?          PLEASE READ - this warning will be converted to an error
?          in future. Please update ancil file to specify the correct
?          grid stagger value.
?  Warning from processor: 0
?  Warning number: 3
????????????????????????????????????????????????????????????????????????????????

I am running this suite for the year 2013 and I created the land ancillaries for this by using the ancil generation suite. I had successfully created ancils this way and had ran simulations successfully earlier. These new ancillaries have the domain shifted towards the north by 5 degrees, compared to my earlier ancillaries. I found a similar ticket #3013 on this issue. However, I am not able to understand the exact reason for my error. The job.out file is in /home/amenon/cylc-run/u-bq450/log/job/20130612T0000Z/glm_um_recon1/04/.

Thanks,
Arathy

Change History (7)

comment:1 Changed 3 months ago by grenville

Arathy

This is just a warning, the error is

OOM killer terminated this process.

which means it ran out of memory.

Did this exact job reconfigure prior to use of new land ancillairies?
Grenville

comment:2 Changed 3 months ago by amenon

Hi Grenville,

That's good to know. In this suite, the reconfiguration is failing in the first cycle itself. But this exact job reconfigured successfully in the suite u-bn032 (from which this suite is copied) with the old land ancillaries. The only difference in this suite compared to u-bn032 is the new land ancillaries and the different start/end dates.

Thanks,
Arathy

comment:3 Changed 3 months ago by grenville

Arathy

Am I correct thinking that the driving global model is the same for u-bn032 and u-bq450? The new ancillaries are for the nested limited area only?

Please re run with extra diagnsotic messages on; set RCF_PRINTSTATUS to "Extra diagnostic messages"

Where do you specify the processor decomposition for the global reconfiguration in this suite?

Grenville

comment:4 Changed 3 months ago by amenon

Hi Grenville,

Yes, the driving global model is the same (n1280) for both u-bn032 and u-bq450. The new ancillaries are for the nested limited area only. In u-bn032, the central co-ordinates for LAM are (75,20) and for u-bq450, the central co-ordinates are (75,25).

The RCF_PRINTSTATUS is already set to Extra diagnostic messages in this suite.

The processor decomposition for the driving model is 16X32 in the driving model setup. I don't specify it separately for reconfiguration. Does reconfiguration use the same decomposition?

Arathy

comment:5 Changed 3 months ago by grenville

The RCF_PRINTSTATUS is already set to Extra diagnostic messages in this suite.

not so for the glm_um app in u-bq450 - it's set to PrStatus_Normal

The reconfiguration processor decomposition is hard wired to 4x3 - try increasing this to 8x6 change

{% set RCF_NPROCY = 4 %}
{% set RCF_NPROCX = 3 %}

in /home/amenon/roses/u-bq450/site/ncas-cray-xc30/suite-adds.rc

Grenville

comment:6 Changed 3 months ago by amenon

Sorry Grenville, it was the RCF_Printstatus for UM that was set to Extra diagnostic messages. I got confused.

I increased the number of processors as you suggested and it worked. Many thanks. We can close this ticket now.

Arathy

comment:7 Changed 3 months ago by grenville

  • Resolution set to fixed
  • Status changed from new to closed

Arathy

Memory usage must have been at the limit for 1 node (I assume)

Grenville

Note: See TracTickets for help on using tickets.