#2246 closed help (fixed)

N216 reconfiguration to Global

Reported by: sam89 Owned by: willie
Component: UM Model Keywords:
Cc: Platform: Monsoon2
UM Version: 8.2

Description (last modified by willie)

Hello

I have a N216 resolution start dump that I am trying to reconfigure to Global resolution. I managed to reconfigure it to Global resolution but I now need to run the resulting .astart for a model run as I needed it in Global resolution.
In the .leave file there are a number of MPI_Abort signals and then this error

ATP Stack walkback for Rank 170 starting:
  _start@start.S:113
  __libc_start_main@libc-start.c:242
  flumemain_@flumeMain.f90:48
  um_shell_@um_shell.f90:2370
  u_model_@u_model.f90:3720
  atm_step_@atm_step.f90:11009
  atmos_physics2_@atmos_physics2.f90:4336
  ni_conv_ctl_@ni_conv_ctl.f90:2305
  glue_conv$glue_conv_mod_@glue_conv-gconv4a.f90:2369
  ereport64$ereport_mod_@ereport_mod.f90:102
  gc_abort_@gc_abort.F90:136
  mpl_abort_@mpl_abort.F90:46
  pmpi_abort@0x190606c
  MPI_Abort@0x192f884
  MPID_Abort@0x1959521
  abort@abort.c:92
  raise@pt-raise.c:42
ATP Stack walkback for Rank 170 done
Process died with signal 6: 'Aborted'
Forcing core dumps of ranks 170, 0
View application merged backtrace tree with: stat-view atpMergedBT.dot
You may need to: module load stat

I looked in the pe_output and it suggests something about glue_conv which I believe is some sort of instability in the model. The only thing I changed to get it to run using the N216 start dump was that I had to switch aerosol usage off so I am wondering if this is why it is causing an error. Other than that I didn't need to change anything from the typical standard Global setup so I am unsure how to fix the error.

All I am trying to do is to make my N216 start dump at 03 hours into Global resolution so I can then run it as a Global job for 18 hours.

I am due to submit my thesis in just 3 months so i am working on a really tight time frame so it would be great if you could help me.

Thanks
Sam
This is the .leave file:
xnojf000.xnojf.d17222.t133449.leave

Change History (20)

comment:1 Changed 22 months ago by willie

  • Description modified (diff)
  • Owner changed from um_support to willie
  • Status changed from new to accepted

Hi Sam,

The problem is occurring in the first time step and the root cause is "mid conv went to the top of the model" which indicates a problem in convection.

I have tried changing the ancillary files and the convection scheme to no avail: it still fails in the same way.

There are still some tests I could do, but it is looking difficult at the moment …

Regards
Willie

comment:2 Changed 22 months ago by willie

Hi Sam,

I now have it running for 81 times steps. This is my job xnowb. I am unable to say what I did to achieve this: I notice that you modified xnojy and xnojf this morning. This hasn't helped the analysis.

Regards
Willie

comment:3 Changed 22 months ago by sam89

Hi Willie

I realised I was not using many ancillaries for my xnojy reconfigure job so I added these in and it seemed to run then. I then ran xnojf and it seemed to run (i think). I got Sue to check it and she seems to think it went ok too. I then used a dump from that run at 18 UTC to try to rerun the job for 5 days as I am doing ensemble run starting at 18UTC running for 5 days. I appear to be getting the same problem as above though for this run and I have no idea what is causing it as I have done a difference between this job and the xnojf and they seem the same.
xnpbc000.xnpbc.d17231.t181958.leave

I am trying to repeat this job 6 time for difference ensemble members and each of the jobs seems to be unstable but the reconfigure of the 18 UTC start dumps seem ok. It is all my UM jobs in experiment xnpb.

I attempted to get two of the runs to work (xnpbc and xnpbj) and both fail due to the same instability issue as this ticket was started with.

Please can you check if both xnojy and xnojf ran properly and then if so why I now cannot get the xnpbc and xnpbj jobs to run (from the reconfig jobs xnpbb and xnpbe).

I thought I fixed the instability but I guess I hadn't as it is still occurring for this run.

Thanks

Sam

comment:4 Changed 22 months ago by sam89

Also I should add that I have been doing some Euro4km runs which were initialised from the output from xnojf and the Euro4km runs seem fine its just these resulting Global runs which just start at 18 UTC from the dump out at 18 UTC from xnojf (which I reconfigured to create a .astart file then ran)

comment:5 Changed 22 months ago by sam89

I just looked at the pe_output and I cannot see anything obvious as to this instability except that every output file seems to end with

Atm Step: Lexpand_ozone T

STASH: Item 201 Section 4 required by stash
STASH: Item 202 Section 4 required by stash
STASH: Item 201 Section 5 required by stash
STASH: Item 202 Section 5 required by stash

Q_POS: unable to conserve on 29 levels

I don't know if this is relevant though.

comment:6 Changed 22 months ago by sam89

Had another look at the .leave file and it appears this is the issue: ==============================================

initial Absolute Norm : 65657009.208471425
GCR( 2 ) failed to converge in 500 iterations.
Final Absolute Norm : 0.2058444645717126
==============================================

Q_POS: unable to conserve on 29 levels

I tried copying the xnojy and xnojf jobs and rerunning them but just changing the start dump (to the 18 UTC dump from xnojf) and and start time to 18 UTC and this same error occurs as above.

xnoji000.xnoji.d17232.t082932.leave (reconfiguration is xnojh).

I imagine it must be an issue with the .astart file which means it is either an issue with the reconfiguration of the startdump xnojfa_da015 making something wrong with the .astart file or it is an issue with the start dump itself however from what I can see xnojf seems to have run fine and the xnojfa_da015 looks ok when opening it up. It also seems to run fine when using it as the start dump for the Euro4km job I am using.

Sorry if this all sounds really confusing!

comment:7 Changed 22 months ago by sam89

i reran xnoji with 120 timesteps instead of 144 (so 5 every hour instead of 6) now in the .leave file it says

targeted diffusion in 0 columns

==============================================
initial Absolute Norm : 137387.58441983326
GCR( 2 ) converged in 151 iterations.
Final Absolute Norm : 6.92044189133555684E-3
==============================================

Q_POS: unable to conserve on 29 levels

it still only appears to get to the 5th timestep though I think. No longer is it saying it does not converge but it still seems to have the Q_POS: unable to conserve on 29 levels and I am not sure what this means.

comment:8 Changed 22 months ago by willie

Hi Sam,

I copied your xnoji, changed from 144 to 288 time steps per day and
switched on dumping every timestep. This failed at time step 8. The
dump file showed huge vertical winds at 20m, surface temperatures
exceeding 6000K. I don't know what is causing this.

Your data flow is very complex, but has the following chain at its core:

  1. Run PS31 MOGREPS
    1. RCF (xnoeb)
    2. Run (xnoed dumping at 3hours
  2. Run PS30 Global for one day
    1. RCF (xnojy)
    2. Run (xnojf)
  3. Run PS30 Global for five days
    1. RCF (xnpbb)
    2. Run (xnpbc)

This chain has not worked. We had problems with xnojf and now with
xnpbc, which fails just like xnoji. xnoji is just a variant of xnojf,
so likly to suffer the same problems. It is important to get the core
chain to work reliably before cloning it.

In the MOGREPS run you are introducing IAU increments right at the
start of the chain. But the problem of using increments was
unresolved in ticket #2234, so this could be the cause of the current
issue. You could test this by repeating the above chain with no
increments added.

The processing chain could be simplified I think. Step three could be
avoided by running step two for five days. This will be important
when you replicate the chain six or twelve times. If a five day run
is too long for the Monsoon queue, then you need to split the job into
chunks and do an NRUN followed by a CRUN.

That's all I have at the moment.

Regards
Willie

comment:9 Changed 22 months ago by sam89

Hi Willie

Thanks for that as far as I was aware though the MOGREPs run now worked as there was no error in the .leave file. Unfortunately I need to add the perturbations as it's an ensemble run. I also cannot cut the middle step out as I need to restart the Global run for 5 days from the 18 utc start dump.

The only other thing I can think to try is if you could show me where there is a standard N216 jobs on Monsoon? As I currently used a standard N400 job but the initial startdump used to initialise that run along with the perturbation files were created as an N216 run. I could not find a standard N216 job though and when I tried to modify a Global jobto this resolution it failed. When I used the N400 job and changed the ancillaries to the N216 it seemed to run fine but I guess not if it's not working. Me and Sue both checked the output though and it seemed fine along with the .leave file.

Let me know what you suggest.

Aam

comment:10 Changed 22 months ago by sam89

I should say that the N400 job worked hence I closed the other ticket about the MOGREPS run as I did indeed give up on that but I did get it to work

comment:11 Changed 22 months ago by sam89

As I said also the xnojf and smoky did appear to work eventually, both me and Sue could no see anything in th .leave file to suggest otherwise and the output seemed fine. The xnpbc run just fails just to this instability so I thought it may be a problem with the previous runs but there is now nothing in those runs to suggest why they would cause this issue now.

I'm kind of stuck as I have to get this running to be able to submit my PhD in December.

Something is obviously going dodgy in the runs before xnpbc to make it so unstable so that it fails to run but as they all seem to run fine now it's hard to work out why this one is not working. I have to run the job like this too as it's my specific method so I can't cut any steps out unfortunately.

comment:12 Changed 22 months ago by sam89

Also I forgot to say I do have a control run which has no perturbations added at the start and this one appears to have the same issue when running as a Global for 5 days

comment:13 Changed 22 months ago by sam89

Job willie

Sorry for all the messages.

Just realised you said you copied xnoji and then when adding more timesteps it fails. Does this mean that although it runs fine for 144 timesteps it still isn't actually working properly? As it seems to run without error for 144 timesteps. Sorry think I got confused and thought you said you copied the Global 5 day run job.
As I said I'm a bit confused by it all as it seemed to all run without error except for xnpbc.

Sorry I know it's a really confusing method!

Sam

comment:14 Changed 22 months ago by sam89

Hi,

I have been trying these things:

  1. Re-run the job xnpbc but with the startdump xnojfa_da015 instead of reconfiguring this start dump and using the resulting .astart file.
  2. Re-run xnojf for 5 days instead of stopping after 24 hours
  3. Run the MOGREPS N400 job for 3 days instead of stopping at 03hrs. My MOGREPS experiment is xnoe.
  4. Run job xnpbd (which is the same as xnpbc but with the IAU file switched on) with the old IAU file added at 18hrs to see if it runs or causes an issue
  5. Looked at the Euro4km job.

The results I have are as follows:

  1. The NRUN seemed to work fine from what I can establish from the .leave file (xnpbc000.xnpbc.d17236.t151705.leave), if you could check though this would be great. This file also shows the sw/lw warning I was talking about.

The CRUN for the job is xnpbc000.xnpbc.d17236.t154611.leave
This appears to get through the second day so to timestep 288 but does not resubmit further for some reason
I don't seem to see an error so I am not sure why it does not resubmit. The CRUN is obviously working since it was able to run as a CRUN at all to run for a second day. I have checked and I have it set up to run for 5 days and to resubmit every day and I am also dumping out every 24 hours.
It seems to be creating the 48 hour dump (and the actual dump itself looks ok and has sensible values for the fields) so I am not sure why it would then not resubmit and run for a third day. Could there be a reason why it won't resubmit it?
Again, I cannot see anything obvious in the .leave file to suggest it failed or had an error etc. Again though this 5 day job was a copy of my 5 day job that has previously worked and just resubmits every 24 hours so I don't think it would be something wrong with the actual job set up but I could be mistaken.

  1. This run seems to run fine for the NRUN xnojf000.xnojf.d17236.t152204.leave.

The CRUN is xnojf000.xnojf.d17236.t155107.leave
The same thing happens for this run in that it seems to get to timestep 288 then does not resubmit further but I again cannot see a reason as to why it will not resubmit from looking at the .leave file.

  1. The MOGREPS N400 at N216 resolution seems to run ok for the full 3 days. It gets through all 432 timesteps but I did notice for each timestep it says something along these lines:

Atm_Step: Timestep 432 Model time: 2012-07-08 00:00:00
MPPIO: Open: xnoeka_pd069 on unit 63

==============================================
initial Absolute Norm : 604.93618652871203
GCR( 2 ) converged in 29 iterations.
Final Absolute Norm : 2.64989693338755963E-3
==============================================

Q_POS: unable to conserve on 31 levels

Minimum theta level 1 for timestep 432

This timestep This run

Min theta1 proc position Min theta1 timestep

226.32 5 178.3deg E -79.4deg S 224.76 268

Largest negative delta theta1 at minimum theta1

This timestep = -2.03K. At min for run = -6.55K


From what I can see though it does not cause the job to fail as it says it for every timestep yet carries on through to the last timestep: see xnoek000.xnoek.d17236.t153737.leave. But I assume this is why we then get the error in the job xnpbc. I am unsure whether this is the root of all the problems though/ how to fix it if so. I also noticed by xnpbc and xnojf that they have the same error as well.

  1. This again seemed to run fine for the NRUN xnpbd000.xnpbd.d17236.t155603.leave. I don't know if it would cause it to fail right now though if I had an IAU file created from the output from these runs as I will eventually have.

The CRUN run again fails after 2 days.

  1. The Euro 4km job I am running should run fine as I am only running for 12 hours whereas previously I ran it for 24 hours and it ran fine and did not run out of wall time, so I am not sure why this is not running properly. I am fairly certain I have the wall time set to the maximum as well. I don't know why it would run out of walltime when running for half the time. See job xnotn.

From what I can see it seems it was an issue with reconfiguring the output dump from xnjof but I may be missing something in the .leave file which suggests it is still not running properly. There must still be an issue since it will not run for 5 days and only 2 but it has got further and the main thing is that I think the MOGREPS job runs fine and both the original Global job and the one starting at 18hrs run for the same amount of time so it may be that there is an issue with the reconfiguring to Global resolution from MOGREPS unless there is an easy fix to get the job to run for 5 days instead of 2 (which I can't work out how to fix as there doesn't appear to be anything obvious in the .leave file as to why it will not resubmit). If you copy across any of the jobs to try to get it to run for 5 days instead of 2 then can you try xnpbc as this is the one i would prefer to get to work as it then matches my previous method of starting at 18hrs.

  1. As a test I used the 48 hr dump to restart the job on 7th July 2012 at 18 hrs (instead of 5th) to see if this would run (xnpba).

xnpba000.xnpba.d17236.t172339.leave
This seems to run fine as an NRUN so it must not be a problem with the 48 hr dump as to why the job (xnpbc) will not resubmit, so now I am really confused! It will not be ideal if I have to do a separate run for each 2 day run as I already need to run it 6 times for 6 different ensemble members so that will be 18 runs! So if you are able to work out why it will not just automatically run for 5 days that would be great!
Hopefully it is just an issue with the resubmission itself since the 48 start dump seems to work fine.

Let me know your thoughts!

Thanks,

Sam

comment:15 Changed 22 months ago by willie

Hi Sam,

I have created an N216 forecast job (Build, RCF and Forecast) for Monsoon2. This is xnpe. It is derived from an old vn8.2+ nesting suite job and has been well tested. I ran it for 10 days successfully. Maybe you could use this to drive your MOGREP_G model.

I think that the main problem here is that the chain xnoeb/xnoed/xnojy/xnojy/xnojh/xnoji is not working and this is shown by the failure in xnoji after 4 or 5 time steps. I think that the latter part of the chain (after xnoeb/xboed) is part of your standard data flow, that you have had it working for some time and all that you ever change is the input data. The current problems occur when you drive with the MOGREPS-G (xnoeb/xnoed). Let me know if you concur with this summary. If not, could you provide a concise summary of the problem.

Regards
Willie

comment:16 Changed 22 months ago by willie

Hi Sam,

Regarding the CRUN issue in xnpbc000.xnpbc.d17236.t154611.leave, you're getting

qsub: error: [PBSInvalidProject] 'diamet' is not valid for unknown trustzone on XCS

This was a Monsoon issue which has now been solved. Just resubmit the job again.

I think we should keep this ticket to the core issue. If you have further problems please create a new ticket.

Regards
Willie

comment:17 Changed 22 months ago by willie

Hi Sam,

The last few leave files you mentioned:

xnojf000.xnojf.d17236.t155107.leave - this is the Monsoon problem again
xnpbd000.xnpbd.d17236.t155603.leave - this completed 144 time steps w/o error
xnpba000.xnpba.d17236.t172339.leave - this completed 144 time steps w/o error

So this focusses the problem down to the chain ending in xnoji.

I also noticed that in your MOGREPS-G (xnoed)job you have switched off the section 35 Stochastic Scheme. This is different from the standard N400 MOGREPS-G (xkffd). Is that important?

Regards
Willie

comment:18 Changed 22 months ago by sam89

I am trying to get the N216 job to run. It seems to be failing in the run job though (xnpic) due to qatmos. I am sure it is failing due to something simple but I cannot work out what it is.
Could you take a look at it for me? I think it may be because you built the N216 job using the Global start dump and not the N216 startdump, but it equally could just me doing something silly/forgetting to change something!

xnpic000.xnpic.d17243.t070618.leave

The chain works now it was just failing in the CRUN but if that was just a monsoon issue then it is ok now. I will still get the N216 running and run it on from this though instead of the N400 as it seems more sensible since I am starting from N216 resolution.

Thanks,

Sam

comment:19 Changed 22 months ago by sam89

Perhaps there may be an issue with the reconfig job xnpib? I checked the output and it looks fine though and from what I can see in the .leave file it seemed ok, I could be wrong though:

xnpib000.xnpib.d17243.t064022.rcf.leave

comment:20 Changed 22 months ago by willie

  • Resolution set to fixed
  • Status changed from accepted to closed

Closed as Sam says the chain now runs. The xnpic issue is now dealt with in #2258

Note: See TracTickets for help on using tickets.