Opened 3 years ago

Closed 3 years ago

#2428 closed help (answered)

Nesting suite problems

Reported by: amenon Owned by: willie
Component: UM Model Keywords: reconfiguration, domain size
Cc: Platform: ARCHER
UM Version: 10.9


Posted from email:

I have a suite in Archer (u-av692) which is a trial suite that I am running just for a day to see everything works fine. The LAM recon in that suite keeps failing with the error ‘wall clock limit exceeded’. As I mentioned you earlier I even tried with a wall clock time of 3 hours and it still fails. I contacted Stu. He said it could be some other issue then and its better to sort this out with you.

Change History (4)

comment:1 Changed 3 years ago by willie

  • Keywords reconfiguration, GCOM collectives added
  • Owner changed from um_support to willie
  • Status changed from new to assigned
  • UM Version set to 10.9

Hi Arathy,

Try setting gcom_col_limit to one - it's in um → namelist → Top level .. → Parallel communications options. You shouldn't need more than 10 to 15 mins generally.


comment:2 Changed 3 years ago by amenon

Hi Willie,

Thanks fro the response. I tried setting gcom_coll_limit to 1. At the same time I changed the wall clock limit back to 20 minutes from 3 hours. But the suite still failed with the same error.


comment:3 Changed 3 years ago by willie

  • Keywords domain size added; GCOM collectives removed

Hi Arathy,

Your start dump is a whopping 1536x1152 points, but you have allocated only 12 processors to do the reconfiguration:

{% set RCF_NPROCY = 4 %}
{% set RCF_NPROCX = 3 %}

I would suggest increasing this so that there are about 80x80 points per processor: in suite-adds.rc in the Nested LAM initialisation change the above to,

{% set RCF_NPROCY = 22 %}
{% set RCF_NPROCX = 12 %}

You may need to extend the wall time to 1 hour say.


comment:4 Changed 3 years ago by willie

  • Resolution set to answered
  • Status changed from assigned to closed
Note: See TracTickets for help on using tickets.