Opened 2 years ago

Closed 2 years ago

#2452 closed help (worksforme)

Reconfiguration problem with UM

Reported by: MJohnston1 Owned by: um_support
Component: UM Model Keywords:
Cc: Platform:
UM Version: 10.9

Description

Hello CMS Helpdesk,

I'm having an unknown problem in which the UM_recon stage is never-ending. This stage previously completed in a few minutes but is now failing with 'walltime exceeded' errors when given a walltime of 3600 seconds.

I have encountered this issue with both suites u-ax016 and u-ax359.

Can you please advise on what might be happening here, and any possible solution?

Best Regards,
Michael Johnston

Attachments (1)

diff_u-aw645_u-ax016.txt (12.4 KB) - added by MJohnston1 2 years ago.
difference between successful and failed suites

Download all attachments as: .zip

Change History (15)

comment:1 Changed 2 years ago by grenville

Michael

How does the failing suite differ from the successful one?

Grenville

Changed 2 years ago by MJohnston1

difference between successful and failed suites

comment:2 Changed 2 years ago by grenville

Michael

Where does /work/n02/n02/xb899100/cylc-run/u-ax016/ancil/lsm come from — it has some very odd lat -long values.

The job.err file also indicates problems with lsm

Warning message: Ancil file mismatch in fixed header(9) grid stagger value

? Model grid stagger = 6
? Ancil file grid stagger = 3
? Ancil file path = /work/n02/n02/xb899100/cylc-run/u-ax016/ancil/lsm
? PLEASE READ - this warning will be converted to an error
? in future. Please update ancil file to specify the correct
? grid stagger value.

Grenville

comment:3 Changed 2 years ago by MJohnston1

Hi Grenville,

Thanks for picking up this problem. I've used the same ancil with this suite without issue before, but I will check to make sure that this is not the issue.

Best Regards,
Michael

comment:4 Changed 2 years ago by MJohnston1

I've made this ancil file. I took the land-sea mask from the end dump of a simulation with this domain setup that contained all sea points. I then converted it to netcdf, edited it, and used xancil to creat the ancil file.

I can't seem to find where to edit the ancil file grid stagger. Is this in xancil?

comment:5 Changed 2 years ago by grenville

What is the dump referred to above?

comment:6 Changed 2 years ago by MJohnston1

I don't think I saved the original dump, but it looks like this, except without any land points:

/work/n02/n02/xb899100/cylc-run/u-ax016/share/data/history/exp1_island_on_wrong_side_lol/ax016a_da10000101_06

comment:7 Changed 2 years ago by grenville

Michael

The ancil header mismatch is a red herring I think.

I'm not familiar with the idealised setup at later versions - what are RES_DX and RES_DY?

Have you tried increasing the number of processors for reconfiguration (suite info → Reconfiguration processor decomposition) to say 6x4 or more?

Grenville

comment:8 Changed 2 years ago by MJohnston1

It looks like RES_DX and RES_DY are he horizontal east-west and north-south resolutions for a cartesian grid. But as far as I can tell, they are ignored in favour of UM → namelist →UM Science Settings → Idealised → Initialisation → Horizontal Grid "delta_xi1" and "delta_xi2".

I will try increasing the number of processors for reconfiguration. It looks like it is currently on 1x4, I will try 6x4.

comment:9 Changed 2 years ago by MJohnston1

For suite u-ax016, in Reconfiguration Processor Decomposition, I have set "Reconfiguration: Processes East-West" = 6, and I have left "Reconfiguration: Processes North-South" = 4.

This again fails, in the job.err I'm seeing:

=>> PBS: job killed: walltime 1213 exceeded limit 1200
aprun: Apid 30688840: Caught signal Terminated, sending to application
Terminated
Received signal TERM
_pmiu_daemon(SIGCHLD): [NID 03567] [c2-2c1s11n3] [Tue May  1 13:43:07 2018] PE RANK 10 exit signal Terminated
/work/n02/n02/xb899100/cylc-run/u-ax016/share/fcm_make/build-recon/bin/um-recon: line 118: 15937 Terminated              rose mpi-launch -v $COMMAND
cylc (scheduler - 2018-05-01T13:43:08Z): CRITICAL Task job script received signal TERM at 2018-05-01T13:43:08Z
cylc (scheduler - 2018-05-01T13:43:08Z): CRITICAL failed at 2018-05-01T13:43:08Z

In the job.out file it seems that the last thing it output was:

Set to user const  18 ( Section   0 ) ( Stashcode  26 ) ROUGHNESS LENGTH AFTER TIMESTEP

I think I noticed that it failed at the same point when I allocated 86400s walltime. Is it worth attempting to run the suite without prescribing my roughness length?

comment:10 Changed 2 years ago by MJohnston1

Switching off the prescribed surface roughness length seems to have allowed it to run successfully.

comment:11 Changed 2 years ago by grenville

Great - is there any good documentation on how the idealised model works now. Who knew that the grid spacing is now in meters for example?

comment:12 Changed 2 years ago by MJohnston1

I have been working with a combination of the documentation for vn10.7, 10.8, and 10.9. Together, they seem to give the most complete set of documentation that I am aware of for the idealised model at vn10.9.

comment:13 Changed 2 years ago by grenville

Thanks for this.

Grenville

comment:14 Changed 2 years ago by MJohnston1

  • Resolution set to worksforme
  • Status changed from new to closed

No problem! I think that somehow a branch that I am using is interfering with the l_spec_z0 idealised option. I will need to revisit my branch to figure out a fix, but I think we can close the ticket.

Thank you for your help,

Michael

Note: See TracTickets for help on using tickets.