#2452 closed help (worksforme)

Reconfiguration problem with UM

Reported by: MJohnston1 Owned by: um_support
Component: UM Model Keywords:
Cc: Platform:
UM Version: 10.9

Description

Hello CMS Helpdesk,

I'm having an unknown problem in which the UM_recon stage is never-ending. This stage previously completed in a few minutes but is now failing with 'walltime exceeded' errors when given a walltime of 3600 seconds.

I have encountered this issue with both suites u-ax016 and u-ax359.

Can you please advise on what might be happening here, and any possible solution?

Best Regards,
Michael Johnston

Attachments (1)

diff_u-aw645_u-ax016.txt (12.4 KB) - added by MJohnston1 18 months ago.
difference between successful and failed suites

Download all attachments as: .zip

Change History (15)

comment:1 Changed 18 months ago by grenville

Michael

How does the failing suite differ from the successful one?

Grenville

Changed 18 months ago by MJohnston1

difference between successful and failed suites

comment:2 Changed 18 months ago by grenville

Michael

Where does /work/n02/n02/xb899100/cylc-run/u-ax016/ancil/lsm come from — it has some very odd lat -long values.

The job.err file also indicates problems with lsm

Warning message: Ancil file mismatch in fixed header(9) grid stagger value

? Model grid stagger = 6
? Ancil file grid stagger = 3
? Ancil file path = /work/n02/n02/xb899100/cylc-run/u-ax016/ancil/lsm
? PLEASE READ - this warning will be converted to an error
? in future. Please update ancil file to specify the correct
? grid stagger value.

Grenville

comment:3 Changed 18 months ago by MJohnston1

Hi Grenville,

Thanks for picking up this problem. I've used the same ancil with this suite without issue before, but I will check to make sure that this is not the issue.

Best Regards,
Michael

comment:4 Changed 18 months ago by MJohnston1

I've made this ancil file. I took the land-sea mask from the end dump of a simulation with this domain setup that contained all sea points. I then converted it to netcdf, edited it, and used xancil to creat the ancil file.

I can't seem to find where to edit the ancil file grid stagger. Is this in xancil?

comment:5 Changed 18 months ago by grenville

What is the dump referred to above?

comment:6 Changed 18 months ago by MJohnston1

I don't think I saved the original dump, but it looks like this, except without any land points:

/work/n02/n02/xb899100/cylc-run/u-ax016/share/data/history/exp1_island_on_wrong_side_lol/ax016a_da10000101_06

comment:7 Changed 18 months ago by grenville

Michael

The ancil header mismatch is a red herring I think.

I'm not familiar with the idealised setup at later versions - what are RES_DX and RES_DY?

Have you tried increasing the number of processors for reconfiguration (suite info → Reconfiguration processor decomposition) to say 6x4 or more?

Grenville

comment:8 Changed 18 months ago by MJohnston1

It looks like RES_DX and RES_DY are he horizontal east-west and north-south resolutions for a cartesian grid. But as far as I can tell, they are ignored in favour of UM → namelist →UM Science Settings → Idealised → Initialisation → Horizontal Grid "delta_xi1" and "delta_xi2".

I will try increasing the number of processors for reconfiguration. It looks like it is currently on 1x4, I will try 6x4.

comment:9 Changed 18 months ago by MJohnston1

For suite u-ax016, in Reconfiguration Processor Decomposition, I have set "Reconfiguration: Processes East-West" = 6, and I have left "Reconfiguration: Processes North-South" = 4.

This again fails, in the job.err I'm seeing:

=>> PBS: job killed: walltime 1213 exceeded limit 1200
aprun: Apid 30688840: Caught signal Terminated, sending to application
Terminated
Received signal TERM
_pmiu_daemon(SIGCHLD): [NID 03567] [c2-2c1s11n3] [Tue May  1 13:43:07 2018] PE RANK 10 exit signal Terminated
/work/n02/n02/xb899100/cylc-run/u-ax016/share/fcm_make/build-recon/bin/um-recon: line 118: 15937 Terminated              rose mpi-launch -v $COMMAND
cylc (scheduler - 2018-05-01T13:43:08Z): CRITICAL Task job script received signal TERM at 2018-05-01T13:43:08Z
cylc (scheduler - 2018-05-01T13:43:08Z): CRITICAL failed at 2018-05-01T13:43:08Z

In the job.out file it seems that the last thing it output was:

Set to user const  18 ( Section   0 ) ( Stashcode  26 ) ROUGHNESS LENGTH AFTER TIMESTEP

I think I noticed that it failed at the same point when I allocated 86400s walltime. Is it worth attempting to run the suite without prescribing my roughness length?

comment:10 Changed 18 months ago by MJohnston1

Switching off the prescribed surface roughness length seems to have allowed it to run successfully.

comment:11 Changed 18 months ago by grenville

Great - is there any good documentation on how the idealised model works now. Who knew that the grid spacing is now in meters for example?

comment:12 Changed 18 months ago by MJohnston1

I have been working with a combination of the documentation for vn10.7, 10.8, and 10.9. Together, they seem to give the most complete set of documentation that I am aware of for the idealised model at vn10.9.

comment:13 Changed 18 months ago by grenville

Thanks for this.

Grenville

comment:14 Changed 18 months ago by MJohnston1

  • Resolution set to worksforme
  • Status changed from new to closed

No problem! I think that somehow a branch that I am using is interfering with the l_spec_z0 idealised option. I will need to revisit my branch to figure out a fix, but I think we can close the ticket.

Thank you for your help,

Michael

Note: See TracTickets for help on using tickets.