Opened 10 years ago

Closed 10 years ago

#584 closed help (fixed)

Global vn 7.1 job has started failing in dump reconfiguration

Reported by: cbirch Owned by: um_support
Component: UM Model Keywords: dump reconfiguration
Cc: Platform:
UM Version: 7.1

Description

Hi,

I have been using a vn7.1 global job to create LBC's for a LAM over Africa. I have used the same job many times and it has worked fine. I tried using it again recently but it fails with this error:

aprun: file /work/n02/n02/cbirch/xfrja/bin/qxreconf not found
aprun: Exiting due to errors. Application aborted
/work/n02/n02/hum/vn7.1/pathscale_quad/scripts/qsexecute: Error in dump reconfiguration - see OUTPUT

I am not sure why this is the case. I have feeling it is something to do with the recent changes to hector. As an example of this job - xfrjy worked fine on 4th Jan on phase2a (xfrjy000.xfrjy.d11004.t155749.leave). I copied this job to xfrja and tried to run it on both phase2a (xfrja000.xfrja.d11049.t095355.leave) and phase2b (xfrja000.xfrja.d11048.t170212.leave) and got the error above.

Cheers,
Cathryn

Change History (37)

comment:1 Changed 10 years ago by grenville

Cathryn

The job failed becuase it couldn't find /work/n02/n02/cbirch/xfrja/bin/qxreconf, because it's not there - there is no bin directory under /work/n02/n02/cbirch/xfrja. This has happened to me before, but I'm not sure what the reason is - I think it has something to do with the way you have set up your data directory in the umui. One way around this is to copy the bin directory under umbase into the top level directory (/work/n02/n02/cbirch/xfrja), and then copy qxreconf from umrecon/bin into the top level bin, and also xfrja.exe from ummodel/bin into the top level bin. Then it should work. This is easier done than said.

Grenville

comment:2 Changed 10 years ago by cbirch

Hi Grenville,

That seems to work for the global job - thanks! I'm currently testing the 1.5km job.

Not sure if I should have started another ticket - I have just tried to run the same global job but for a different start date (xftdx). The new start date is in December 2010, so uses a more recent start dump than the 2006 UM start dumps in your AMMA_start_dumps directory. I get the following error (xftdx000.xftdx.d11052.t102420.leave):

*
ERROR!!! in reconfiguration in routine Rcf_Exppx
Error Code:- 2
Error Message:- Cant find required STASH item 490 section 0 model 1 in STASHmaster
Error generated from processor 0
*

This stash item is 'decoupled screen temp on tiles' and it is indeed not included in the vn7.1 STASHmaster files in work/n02/n02/hum/vn7.1/ctldata/STASHmaster/. It is included in the vn7.8 files though. I guess this error didn't occur when using the 2006 start dumps because this item does not appear in those start dumps.

I'm not sure how to fix this. Can I just use the vn7.8 stashmaster file? I'm not sure how to specify this in the UMUI.

Cheers,
Cathryn

comment:3 Changed 10 years ago by willie

Hi Cathryn,

I am currently working on solution to this problem - I think it involves creating a user STASH master file to eliminate STASH 490 from the reconfigured start dump.

Regards,

Willie

comment:4 Changed 10 years ago by willie

  • Type changed from error to help

Hi Cathryn,

I have a new user STASH master file, STASH_7.3_7.5 in my home directory on PUMA. This should remove STASH 490 from the reconfigured start dump. To use it go to Atmos > STASH > User STASH and add it to the table at the top.

I created this file from the STASHmasters from later versions of the UM. The trick with STASH 490 was to set the space code to 10.

Regards,

Willie

comment:5 Changed 10 years ago by cbirch

Hi Grenville/Willie?,

Thanks Wille, that STASHmaster file got rid of the STASH 490 error.

The job still won't run though. The latest error is (xftdx000.xftdx.d11052.t131517.leave):

/work/n02/n02/cbirch/xftdx/bin/qsmaster: Failed in qsexecute in model xftdx
*

Starting script : qsfinal
Starting time : Mon Feb 21 13:32:56 GMT 2011

*

qsfinal: thist file copied to /work/n02/n02/cbirch/xftdx/xftdx.thist.10628
/work/n02/n02/cbirch/xftdx/bin/qsfinal: Error in exit processing after model run
Failed in model executable

I'm not sure if this is related to the original issue? Though all the executables seems in be in xftdx/bin.

Cheers,

Cathryn

comment:6 Changed 10 years ago by willie

Hi Cathryn,

This is a 'top level' error - it just means that something has gone wrong later on. Later in the file you will see it has a segmentation fault and although it has processed several time steps, at one point we have "RHS zero so GCR( 2 ) not needed". This indicates that the model is unstable. Usually this is resolved by reducing the time step. If you go to Atmos > Scientific > Time stepping and change 96 steps per period to 192, say, that should allow further progress.

Regards,

Willie

comment:7 Changed 10 years ago by cbirch

Hi Willie,

I changed the timestep as you suggested but I still get that "RHS zero so GCR( 2 ) not needed" error (xftdx000.xftdx.d11053.t113838.leave).

This error also appeared:

*
UM ERROR (Model aborting) :
Routine generating error: Bi_linear_h
Error code: 10
Error message:

over-writing due to dim_e_out size

*

I saw this on a previous ticket and you solved the problem by configuring the land-sea mask. I didn't know if this would help, but thought it wouldn't hurt to try so I tried doing this but I still get the same RHS zero error (xftdx000.xftdx.d11054.t094126.leave).

I'm not sure what else to try.

Cathryn

comment:8 Changed 10 years ago by willie

Hi Cathryn,

Hmmm! You have some MPI errors too. Could you change your processor configuration from 8 EW x 4 NS to 4 EW by 12 NS and try again, please. You do this on the target machine page.

Regards,

Willie

comment:9 Changed 10 years ago by cbirch

Hi,

I've tried changing the processor configuration as you suggested and the errors look the same as before (xftdx000.xftdx.d11054.t111134.leave).

Cheers,
Cathryn

comment:10 Changed 10 years ago by grenville

Cathryn

There are a couple of things I'd try:

go back to the model settings that worked for xftdy and try using a start file for a different time

try writing a dump on the timestep just before the model crashes and then run again with that dump reconfigured

Grenville

comment:11 Changed 10 years ago by willie

Hi Cathryn,

It did however get rid of the MPI errors. If you now reduce the timestep (again!) to 384 steps per period it should get further.

Regards,

Willie

comment:12 Changed 10 years ago by cbirch

Hi Willie/Grenville?,

I tried reducing the timestep of xftdx to 384, as Willie suggested and it managed to run for 22 hours (xftdx000.xftdx.d11054.t171538.leave). I also tried what Grenville suggested: I tried a different start time for xftdy, which resulted in the same error after a few hours. I tried reducing the timestep and changing the processor configuration of this job, which increased the amount of time it ran for by an hour or so. I also tried writing a dump and then running again with that dump reconfigured but that also failed.

I need to run this job to get LBC's for a vn7.1 LAM. I will probably have to run it for several different days and for up to 3 days so I need something robust, which the vn7.1 job does not seem to be with the 2010 start dumps. Previously I have also used a vn6.1 global job to produce LBC's for the vn7.1 LAM, which has proven to be more reliable. I have tried this job with the 2010 start dumps but I get STASH item errors (xftdz000.xftdz.d11054.t115128.leave):

*
ERROR!!! in reconfiguration in routine Rcf_Exppx
Error Code:- 2
Error Message:- Cant find STASH item 21 section 0 model 1 in STASHmaster
Error generated from processor 0
*

Does this need a new user STASHmaster file, like Willie created for the vn7.1 global job?

Cheers,
Cathryn

comment:13 Changed 10 years ago by lois

Hello Cathryn

We will sit down on Monday (Grenville Willie and I) and work out a strategy for trying to sort this out. You really need a 7.1 job that is a basic set up to take a range of different start files (even if you do have to change the STASHmaster file to adjust to the changes introduced in the newer start files) that can run for 3 days.

Lois

comment:14 Changed 10 years ago by lois

Hello Cathryn,

Having reviewed all your problems Cathryn we have come up with a plan. There are 2 things you basically need
1) a stable global model which runs for at least 3 days from which you can create your LBCs for different start times
2) different start dumps and the approriate STASHmaster file to use in the stable global model

We are going to label vn7.1 (PS20) as our stable global model. which is under the userid umui in the User Interface. When these example jobs are installed and tested they are generally run with a start file created for that model version (this explains why the 2006 start files are ok but 2010 ones have had problems) and only for a few timesteps. We don't either test these jobs for changes from the standard configuration. So we will now test this global standard run for old and new start files and that they stable for 3 days.

We will then endeavour, when new start files are requested, to check that these work in this 3 day stable global model and provide an appropriate STASHmaster file to cope with the changes that have been introduced in the model since this 7.1 release.

We should have this up and ready by the end of the week, although HECToR is down on Wednesday.

On another note you have been asking about the model response to different start data from either Met Office or ECMWF. The model is always going to be sensitive to different initial conditions and a more detailed discussion of this is in a technical report which may be a bit old but it is still relevant (look at page 16-18)
http://ncas-cms.nerc.ac.uk/index.php/um-documentation/technical-reports

You can create boundary conditions from ECMWF analyses using makebc but these would only be 6 hourly as the analyses are only 6 hourly. We have the scripts and the means of doing this so we could do this for you.

More news in a few days.

Lois

comment:15 Changed 10 years ago by cbirch

Hi Lois,

That all sounds great. Thanks a lot for you help.

Thank you also for pointing me to that document on start data.

Cathryn

comment:16 Changed 10 years ago by cbirch

Hi,

I was wondering if you had made any process with the global vn7.1 job?

Cheers,
Cathryn

comment:17 Changed 10 years ago by willie

Hi Cathryn,

After some extensive testing, we have found that the UM7.3 model (UMUI jobs 'xeri') is much more stable. I have run the global, NAE and 4km jobs for run lengths of three days on both XT4 and XE6 and they are stable all the way to the end.

Is it possible for you to switch to this model? We can provide help to make the switch.

Regards,

Willie

comment:18 Changed 10 years ago by cbirch

Hi Willie,

Switching to vn7.3 sounds fine, especially because the runs I want to do using the more recent (2011) start dumps are for a new project so it doesn't really matter which version I use, as long as it works.

I notice that there is a lower resolution job also in 'xeri' (UKV). Does this also work? because I was intending to run the vn7.1 jobs down to 1.5km resolution.

For the work I am doing for 2006 (using Grenville's cascade vn7.1 jobs) I'd rather stick with vn7.1 because I have already done a lot of work using it. I assume you would still support vn7.1 in this case? Obviously if major problems develop in the future I could consider switching.

Should I take the 'xeri' jobs and try to set up the domains over Africa that I need and if they don't work get back in touch?

Cheers,
Cathryn

comment:19 Changed 10 years ago by willie

Hi Cathryn,

The UKV run works and is stable on the XT4 but not on the XE6 - I am hoping to get a solution/workaround soon.

By all means, take a copy of 'xeri' and let me know how it goes.

Regards,

Willie

comment:20 Changed 10 years ago by cbirch

Hi Willie,

I have got the global copy of 'xeri' to produce LBC's for my African domain and also the 12km domain works fine, including when using my ancillary files. Thanks you==for your good work with these!

The 4km domain fails in the dump reconfiguration (xfyhc000.xfyhc.d11095.t132408.leave). I have set up the umui to use ancillary files that match my African domain but the job is still trying to use UK ancillary files. It says this in the .leave file:

Sourcing ancil files for UK 4km

::::::::::::::
/work/n02/n02/hum/ancil/ancil_versions/m4_288360_uk/ps22
::::::::::::::
#########################################################
# Ancil Version File for m4_288360_uk
#########################################################

# File containing filenames version file
UM_ANCIL_FILENAMES=${UM_ANCIL_FILENAMES:-$UMDIR/ancil/ancil_versions/filenames/v1}

I can't work out how to run this off so it uses my ancillary files that are specified in the umui?

Cheers,
Cathryn

comment:21 Changed 10 years ago by grenville

Cathryn

It looks like the environment variable UM_ANCIL_FILENAMES is not set - this could be the same problem as with NLSPATH. As a quick work around, you can specify the full path for the ancillary files in the place where you specify ancillaries and where you currently have the entry $CLASSIC_ANCILS.

Grenville

comment:22 Changed 10 years ago by cbirch

Hi,

I don't really understand what you mean. UM_ANCIL_FILENAMES does not appear in the umui under Atmosphere—> Ancillary and input files, so it shouldn't matter that it is not set as an environment variable. I get the impression a script is specified somewhere in the umui which tells the model to use these UK4 ancillaries instead of the CLASSIC ones that I have specified (using $CLASSIC_ANCILS) in the umui under Atmosphere—>Ancillary and input files. I am not sure how to stop it doing this. Or am I completely missing something?

Cathryn

comment:23 Changed 10 years ago by willie

Hi Cathryn,

In the Script inserts and modifications page on the UMUI, there is a top script defined. This sets up a series of environment variables for the ancillary file names and another series for the directories in which the files are located. This helps insulate the UM parallel suite from changes to the ancillaries.

The simplest thing would be for you to copy the scrip and modify it to point to your own ancillaries. Keep the same environment variables, otherwise you'll have to edit lots of ancillary pages.

Regards,

Willie

comment:24 Changed 10 years ago by cbirch

Hi,

I have got the 4km model (xfyhc) to use my ancillaries now. It produces the .astart file but then fails with this error message (xfyhc000.xfyhc.d11097.t104215.leave):

UM ERROR (Model aborting) :
Routine generating error: READSIZE
Error code: 10
Error message:

Row length is larger than maximum defined in AMAXSIZE

Do you know exactly what this is refering to? I have checked that my ancillaries and the LBC's produced by the 12km run are of the correct dimensions and they all seem fine so I am not sure what else to try. I saw some old tickets (hspcx) which said a mod was required to solve this?

Cheers,
Cathryn

comment:25 Changed 10 years ago by grenville

Cathryn

It is complaining that your domain is bigger than its hard coded limit - I don not know of a branch that fixes this; I have usually made the change in my working copy (it's only one number). Do you have a working copy of the code, ie one that you have obtained with "fcm checkout"?

Grenville

comment:26 Changed 10 years ago by cbirch

Hi Grenville,

I don't think I have a copy of the code. I have never modified UM code using the FCM so I don't know it all works. When I copied your vn7.1 4km job last year this error never occured.

Cathryn

comment:27 Changed 10 years ago by grenville

Cathryn

Include the following in FCM Configuration→FCM options for atmosphere and reconfig in the User Modifications panel

fcm:um_br/dev/grenville/VN7.3_AMAXSIZE/src

The problem with row length should go away.

grenville

comment:28 Changed 10 years ago by cbirch

Hi,

The above fixed the row length problem. The job now compilies and runs (at least for a few hours) but the diagnostics are all nan's. I know I had this problem before with the vn7.1 jobs when using the CLASSIC ancillary files. I got the vn7.1 12km job to produce proper data just by using ancillaries that were ALL made by CAP at the same time. The vn7.1 4km job also produces nan's in all the diagnostics if the (4km) CLASSIC ancillaries are configured. I eventually got this to produce proper data by using just 4km orography and land-sea mask ancillary files and the 12km .astart file. (These jobs are in experiment xfxu.)

With the vn7.3 jobs the 12km domain runs fine with the CLASSIC ancillaries. I can't get the 4km job to produce any data that are not nan's. I have tried using the 12km .astart file and the global UM analysis, with various combinations of ancillary files. In all the tests the diagnostics are all nan's. In these runs the surface temperature is zero degrees K over almost the whole domain in the .astart file(!), no matter what combination of ancillaries or start dump I use (see /work/n02/n02/cbirch/xfyhc/xfyhc.astart as an example). I have no idea what else might case this. In the vn7.1 4km job, although the model produces nan's in the diagnostics, the 4km .astart files always looked ok. (The vn7.3 jobs are experiment xfyh.)

My priority is to get the vn7.3 jobs working. I only really give details of the vn7.1 jobs for information.

Cheers,
Cathryn

comment:29 Changed 10 years ago by grenville

Cathryn

I'm looking at this and will keep you informed of progress.

Grenville

comment:30 Changed 10 years ago by cbirch

Hi,

Thanks for looking into the 4km job for me.

I am starting to run the vn7.3 global and 12km jobs with April 2011 start dumps (in /work/n02/n02/cbirch/start_files/Fennec). The global job fails with this error (xfzfa000.xfzfa.d11108.t162053.leave):

*
ERROR!!! in reconfiguration in routine Rcf_Exppx
Error Code:- 2
Error Message:- Cant find required STASH item 376 section 0 model 1 in STASHmaster
Error generated from processor 0
*

I think I require a STASHmaster file similar to /home/willie/STASH_7.3_7.5 but for 7.3_7.7.

Could you do that for me?

Thanks,
Cathryn

comment:31 Changed 10 years ago by willie

Hi Cathryn,

The new file is called STASH_376_493. Just add this to your user STASH table in the UMUI.

It is in the same place as STASH_7.3_7.5.

Regards

Willie

comment:32 Changed 10 years ago by grenville

Cathryn

The problem with the surface temperature is a bug in the reconfiguration - I ran it on monsoon and the surface temperature came out properly. I am trying to get the model running there for now. Several problems with the reconfiguration (on Hector) have surfaced and we are investigating them in minute detail. We ran the reconfiguration on monsoon for the big cascade runs and then transferred the start file to hector and ran with no problems, so there is a work around for this issue.

Grenville

comment:33 Changed 10 years ago by grenville

Cathryn

The start dump created on monsoon has successfully run your 4km vn7.3 model on Hector for 24 hrs (see the output in /work/n02…/xfzqa) I used a copy of xfyhc to create the startdump, however, I noticed that the soil albedo ancillary in this job is corrupted, so chose not to reconfigure it into the dump (running with the corrupt soil albedo caused a floating point exception on monsoon).

You can use the start dump - or we can get others via monsoon until the reconfiguration problem on Hector is fixed.

I noticed also that the 4km job isn't using the subgrid turbulence scheme (section 13), nor does it use the fully interpolating theta advection (section 12). We run with these schemes - fully interpolating theta prevents some odd behaviour in w at the top of the atmosphere, and I think Smagorinsky has better mixing than the old diffusion scheme.

Grenville

comment:34 Changed 10 years ago by cbirch

Hi Grenville,

I tried running the 4km job (xfyhd) with xfzqa.astart and it seems to work fine.

I have also noticed that there is something wrong with the soil albedo data in the 4km ancillary file when I plot it in xconv. When I plot this data in matlab it looks fine and similar to the 12km ancillary file - with sensible numbers over land and -1.0737e+09 over water. I'm not sure what is wrong with it or what the difference between the 4km and 12km files is. I got both the 4km and 12km ancillary file from http://ncas-cms.nerc.ac.uk/~grenville/CAP_INTERFACE/cap_general.php using the CLASSIC albedos. I got a new set of ancillaries from there today to see if this problem would go away but the problem remains.

In xfyhb I tried to make the changes to sections 12 and 13 like you suggested. I get this error (xfyhb000.xfyhb.d11123.t120441.leave):

Programmed tracer, soil temperature, soil moisture, b.l. levels = 1 4 4 69
File tracer, soil temperature, soil moisture, b.l. levels = 0 4 4 30
*
Failure in call to INITDUMP

*


UM ERROR (Model aborting) :
Routine generating error: INITIAL
Error code: 4
Error message:

PR_INHDA: Consistency check


I had to change the number of BL levels for the subgrid turbulence scheme so I guess this error is caused by xfzqa start dump being inconsistent with the new settings in the 4km job?

I was running the Dec 2010 start date as a test. The actual date I am interested in is 00Z 7th April 2011 (run for 24 hours). The global and 12km runs work fine and are in experiment xfzf. Could you produce another start dump for this date using xfzfq? I preferably want it to include the 4km soil ancillary file and also the changes to sections 12 and 13. I have added these changes to xfxfq. Hopefully the changes to section 12 and 13 will work with the new start dump, though obviously it won't work before the soil ancillary problem is fixed.

Cheers,
Cathryn

comment:35 Changed 10 years ago by grenville

Cathryn

The problem with the 4km soil albedo is around the coast of Madeira - there are values which in the land sea mask are designated as land (eg 342.78E, 12.796N) but have been assigned a bad albedo value (-1.4041e+08 at this point). As far as I can see, this is the only place where the error has happened. I think the most expedient thing to do will be to fix the bad values (only 8 of them) and use xancil to recreate the soil ancillary file. The source albedo data is at 0.05x0.05 deg resolution, so you will get some extra detail at 4km compared to 12km, but probably not much, in which case reconfigurintg the 12km soil albedo might be good enough. I will try to get a new start file for April on Monsoon when it comes back so that you can test the run at least.

Grenville

comment:36 Changed 10 years ago by cbirch

Hi Grenville,

Ok, I should be able to fix that. I'll create a new ancillary (will be next week because I'm away tomorrow).

Thanks for your help.

Cathryn

comment:37 Changed 10 years ago by grenville

  • Resolution set to fixed
  • Status changed from new to closed

Cathryn

I made a start file for the April 2011 case (with reconfigured 12km soil albedo) for the 4km model with fully interpolating theta advection/smagorinsky (see xfzqd.astart) - the model started OK.

Grenville

Note: See TracTickets for help on using tickets.