Opened 3 years ago
Closed 3 years ago
#2452 closed help (worksforme)
Reconfiguration problem with UM
Reported by: | MJohnston1 | Owned by: | um_support |
---|---|---|---|
Component: | UM Model | Keywords: | |
Cc: | Platform: | ||
UM Version: | 10.9 |
Description
Hello CMS Helpdesk,
I'm having an unknown problem in which the UM_recon stage is never-ending. This stage previously completed in a few minutes but is now failing with 'walltime exceeded' errors when given a walltime of 3600 seconds.
I have encountered this issue with both suites u-ax016 and u-ax359.
Can you please advise on what might be happening here, and any possible solution?
Best Regards,
Michael Johnston
Attachments (1)
Change History (15)
comment:1 Changed 3 years ago by grenville
comment:2 Changed 3 years ago by grenville
Michael
Where does /work/n02/n02/xb899100/cylc-run/u-ax016/ancil/lsm come from — it has some very odd lat -long values.
The job.err file also indicates problems with lsm
Warning message: Ancil file mismatch in fixed header(9) grid stagger value
? Model grid stagger = 6
? Ancil file grid stagger = 3
? Ancil file path = /work/n02/n02/xb899100/cylc-run/u-ax016/ancil/lsm
? PLEASE READ - this warning will be converted to an error
? in future. Please update ancil file to specify the correct
? grid stagger value.
Grenville
comment:3 Changed 3 years ago by MJohnston1
Hi Grenville,
Thanks for picking up this problem. I've used the same ancil with this suite without issue before, but I will check to make sure that this is not the issue.
Best Regards,
Michael
comment:4 Changed 3 years ago by MJohnston1
I've made this ancil file. I took the land-sea mask from the end dump of a simulation with this domain setup that contained all sea points. I then converted it to netcdf, edited it, and used xancil to creat the ancil file.
I can't seem to find where to edit the ancil file grid stagger. Is this in xancil?
comment:5 Changed 3 years ago by grenville
What is the dump referred to above?
comment:6 Changed 3 years ago by MJohnston1
I don't think I saved the original dump, but it looks like this, except without any land points:
/work/n02/n02/xb899100/cylc-run/u-ax016/share/data/history/exp1_island_on_wrong_side_lol/ax016a_da10000101_06
comment:7 Changed 3 years ago by grenville
Michael
The ancil header mismatch is a red herring I think.
I'm not familiar with the idealised setup at later versions - what are RES_DX and RES_DY?
Have you tried increasing the number of processors for reconfiguration (suite info → Reconfiguration processor decomposition) to say 6x4 or more?
Grenville
comment:8 Changed 3 years ago by MJohnston1
It looks like RES_DX and RES_DY are he horizontal east-west and north-south resolutions for a cartesian grid. But as far as I can tell, they are ignored in favour of UM → namelist →UM Science Settings → Idealised → Initialisation → Horizontal Grid "delta_xi1" and "delta_xi2".
I will try increasing the number of processors for reconfiguration. It looks like it is currently on 1x4, I will try 6x4.
comment:9 Changed 3 years ago by MJohnston1
For suite u-ax016, in Reconfiguration Processor Decomposition, I have set "Reconfiguration: Processes East-West" = 6, and I have left "Reconfiguration: Processes North-South" = 4.
This again fails, in the job.err I'm seeing:
=>> PBS: job killed: walltime 1213 exceeded limit 1200 aprun: Apid 30688840: Caught signal Terminated, sending to application Terminated Received signal TERM _pmiu_daemon(SIGCHLD): [NID 03567] [c2-2c1s11n3] [Tue May 1 13:43:07 2018] PE RANK 10 exit signal Terminated /work/n02/n02/xb899100/cylc-run/u-ax016/share/fcm_make/build-recon/bin/um-recon: line 118: 15937 Terminated rose mpi-launch -v $COMMAND cylc (scheduler - 2018-05-01T13:43:08Z): CRITICAL Task job script received signal TERM at 2018-05-01T13:43:08Z cylc (scheduler - 2018-05-01T13:43:08Z): CRITICAL failed at 2018-05-01T13:43:08Z
In the job.out file it seems that the last thing it output was:
Set to user const 18 ( Section 0 ) ( Stashcode 26 ) ROUGHNESS LENGTH AFTER TIMESTEP
I think I noticed that it failed at the same point when I allocated 86400s walltime. Is it worth attempting to run the suite without prescribing my roughness length?
comment:10 Changed 3 years ago by MJohnston1
Switching off the prescribed surface roughness length seems to have allowed it to run successfully.
comment:11 Changed 3 years ago by grenville
Great - is there any good documentation on how the idealised model works now. Who knew that the grid spacing is now in meters for example?
comment:12 Changed 3 years ago by MJohnston1
comment:13 Changed 3 years ago by grenville
Thanks for this.
Grenville
comment:14 Changed 3 years ago by MJohnston1
- Resolution set to worksforme
- Status changed from new to closed
No problem! I think that somehow a branch that I am using is interfering with the l_spec_z0 idealised option. I will need to revisit my branch to figure out a fix, but I think we can close the ticket.
Thank you for your help,
Michael
Michael
How does the failing suite differ from the successful one?
Grenville