Opened 4 years ago

Closed 4 years ago

#1688 closed help (answered)

reconfiguration failing

Reported by: anmcr Owned by: um_support
Component: UM Mesoscale Keywords: reconfiguration
Cc: Platform: ARCHER
UM Version: 8.2

Description

Hi Willie,

I still haven't managed to run the 12 km Antarctic job i raised many weeks ago in ticket #1612. I have been doing other things since, but I now need to get this done ASAP so I would be grateful for any help.

The job is XLTUG.

The error is:

############################################################
? Error in routine: mppio:buffin
? Error Code: 22
? Error Message: Error in buffin errorCode= 0. len= 326656 / 379904
? Error generated from processor: 0
? This run generated 87 warnings
############################################################

I'm pretty certain that this is related to the start dump. If you look at the output file for the reconfiguration job (xltug000.xltug.d15280.t165612.rcf.leave) then although it says that the run completed successfully there is hardly any information in it. Examinination of the actual start dump shows several odd values in fields such as 'rain after timestep'. However, I'm unsure what I am doing wrong as the job is an exact copy of your/Grenville's job.

Best wishes,

Andrew

Change History (10)

comment:1 Changed 4 years ago by willie

Hi Andrew,

xltug.start is corrupt - cumf fails when reading it. I suggest you regenerate it - you'll have to recompile xltug since the xltuf job is no longer in the UMUI.

Regards,

Willie

comment:2 Changed 4 years ago by anmcr

hi willie,

thanks for the reply.

can you please confirm that the build job should be based on your job XKZTJ?

thanks,

andrew

comment:3 Changed 4 years ago by anmcr

Dear Willie,

I deleted the executables I created from xltuf last night by accident.

My recollection is that I do the following steps

1) Take a copy of your build job xkztj
2) Switch off the 'hydrology' and 'aerosol' options in the physics
3) Compile reconfiguation and model executable
4) Use these executable in my xltug job to reconfigure and run the model.

Thanks,

Andrew

comment:4 Changed 4 years ago by anmcr

Dear Willie,

I've made some progress but the model still dosen't run. See below.

1) I took a copy of your build job xkztj and remade the reconfiguration and run executables (this is my job xltuf).

2) I used my run job xltug to regenerate the start dump. I checked it with cumf -dOUT ~ xltug.astart xltug.astart. It didn't crash and was readable so I assume that it is ok.

3) However, when i ran the run executable it failed with the error

???!!!???!!!???!!!???!!!???!!!???!!! ERROR ???!!!???!!!???!!!???!!!???!!!???!!!?
? Error in routine: INITIALApplication 18217730 is crashing. ATP analysis proceeding…

? Error Code: 1
? Error Message:
? Error generated from processor: 0
? This run generated 87 warnings
????????????????????????????????????????????????????????????????????????????????

I searched previous tickets but couldn't see anything related to this. It looks as if it is related to reading in data.

I would be very grateful for some guidance.

Best wishes,

Andrew

comment:5 Changed 4 years ago by grenville

Andrew

I took a copy of xltug (my job xlwhz), switched on use of lbcs - that was switched off in xltug causing it to fail.

I think you have a problem with time series in you stash set up - I switched off stash and the model ran OK - see /home/n02/n02/grenvill/um/umui_out/xlwhz000.xlwhz.d15285.t103704.leave

I also reconfigured your dump (not strictly sure that was necessary)

Grenville

comment:6 Changed 4 years ago by grenville

  • Status changed from new to pending

comment:7 Changed 4 years ago by anmcr

Dear Grenville,

Thank you very much for getting this to run. I have a couple of queries:

1) I took my job xltug and switched on lbcs and switched off stash, as you suggested - but the model still failed. I differenced xltug with your job xlwhz and there were a considerable number of additional (minor?) differences. In fact, I only got the model to run by taking a copy of xlwhz (my job xltuj). Did you make any further changes which you would label as important (good for me to know to do my own trouble shooting).

2) Are you able to give some guidance on the optimum number of processors?

Using 576 timesteps per period, running the model for 1 hr took 3 m 43 s using 12x8 processors and 2 m 49 s using 24x8 processors. So it obviously dose not scale linearly.

3) Are you able to give any guidance as to the number of vertical levels and timestep? Previously, running a LAM over Antarctica I have used 38 vertical levels and 288 timesteps per period. Is there any particular reason that I should be using 70 levels as opposed to 38. Is the choice of number of timesteps primarily based on ensuring convergence occurs at each timestep in the dynamics?

Best wishes,

Andrew

comment:8 Changed 4 years ago by grenville

Andrew

1) I'm pretty sure the changes were very minor. My main stumbling block was just the LBCS - I only guessed that after dumping at time step 1 and looking at the winds.

I played with the time step at one point but that wasn't relevant.

2) Only to do what you've done and plot a curve to find where it really goes bad - looks like you did get pretty decent scaling (it'll never be linear). More processors will always cost more AUs, but turnaround time is important too.

3) We ran the CASCADE 12km model with 288 steps/day - the SWAMMA model (this one) has a shorter step - I can't recall why precisely, but the model is generally more stable with a shorter step.

The SWAMMA scientists wanted 70 levels - I think this is a science question - CASCADE ran fine with 38. 70-levels will take roughly 2x longer to run than 38 levels and potentially create 2x as much data.

Are you planning to run at UM 8.2 — this version has the option to output netcdf directly, but there are some STASH features not supported by netcdf (climate meaning, time series)?

Grenville

comment:9 Changed 4 years ago by anmcr

Hi Grenville,

Thanks for your further reply. Very helpful information.

You can now close this ticket.

Best wishes,

Andrew

comment:10 Changed 4 years ago by grenville

  • Resolution set to answered
  • Status changed from pending to closed
Note: See TracTickets for help on using tickets.