Opened 9 years ago

Closed 9 years ago

#731 closed help (fixed)

Errors in Reconfiguration Stage

Reported by: SimonDriscoll Owned by: um_support
Component: UM Model Keywords: volcanoes, aerosol, MONSooN
Cc: Platform:
UM Version: 6.6.3

Description

Hi, I am trying to perform a number of short volcanic simulations with a control run (i.e. volcano eruption, then run for a few years in a control environment and then repeat with different start dumps) with and without slight changes to the aerosol scheme.

I can perform these runs with no change to the scheme (e.g. xfzog), however, when I change the scheme slightly (with a new branch), e.g. xgfel, I get errors at the reconfiguration after a successful compilation such as:

"gc_abort (Processor 40 ): over-writing due to dim_e_out size

gc_abort (Processor 80 ): over-writing due to dim_e_out size gc_abort (Processor 48 ): over-writing due to dim_e_out size gc_abort (Processor 72 ): over-writing due to dim_e_out size"

and

"UM ERROR (Model aborting) :

Routine generating error: Bi_linear_h Error code: 10 Error message:

over-writing due to dim_e_out size"

I have seen this error mentioned here http://cms.ncas.ac.uk/trac/UMHelpdesk/ticket/546 and noted a suggestion was "You could go from 8x8 to 16 North-South x 8 East West." So I tried this with a copied job of xgfel (xgfem), but also at the reconfiguration stage after a successful compilation got:

" Traceback:

Offset 0x00000010 in procedure xltrbk_

gc_abort (Processor 95 ): Job Aborted from Ereport gc_abort (Processor 118 ): Job Aborted from Ereport"

and

"*

ERROR!!! in reconfiguration in routine Rcf_Initialise Error Code:- 50 Error Message:- Total number of processors does not fit EW/NS LPG Error generated from processor 87 *"

I have seen similar problems in this ticket http://cms.ncas.ac.uk/trac/UMHelpdesk/ticket/286 but have no reason to suspect it may the land sea mask that is causing the problem.

Specifically the files I have modified are the following: glue_rad-rad_ctl3c.F90, r2_lwrad3c.F90, r2_swrad3c.F90, r2_set_aerosol_field-fill3a.F90, r2_set_aero_clim_hadcm3-fill3a.F90

(I take latitudinal information from glue_rad and pass it down through r2_lwrad and r2_swrad to r2_set_aersol_field to r2_set_aero_clim to modify the way the volcanic aerosol is represented to make it latitudinally dependent rather than a constant value over latitude).

Best regards,

Simon

Change History (11)

comment:1 Changed 9 years ago by willie

  • Keywords aerosol, MONSooN added; aerosol removed

Hi Simon,

The job runs for 1928 time steps before becoming unstable "RHS zero so GCR(2) not needed". Then NaNs? appear in the output. This occurs after the warning message about volcanic forcing from routine NI_rad_ctl3c.

So the problem is not one of setting up, but is likely to be a science issue.

I hope that helps.

Regards,

Willie

comment:2 Changed 9 years ago by SimonDriscoll

Hi Willie,

I'm confident that none of the volcanic related adjustments have been done in ni_rad_ctl_3c.F90.

I believe this warning to be a more general statement associated with the runs in general. Above this is written:

"Time varying volcanic forcing has been selected but

the check that the run is global has been disabled.
Only use this option for global simulations.
*
UM WARNING :
Routine generating warning: NI_rad_ctl3c
Warning code: -1
Warning message:

WARNING: VOLCANIC FORCING

*"

I believe that Ros has told me this is just a warning related to this, but not an error. Indeed, it present in the runs (where no aerosol changes have been made and a normal branch of the UM is run, e.g. run xfzog).

Is there something you know of that would cause "RHS zero so GCR(2) not needed" and NaN's?

In particular I am aware that some of the calls made to subroutines may not be quite correct - I have been learning FORTRAN alongisde this, computations are easy and just 'employing' an equation, in fact I only do this once, the rest is passing information down from glue_rad to r2_set_aero_clim_hadcm3-fill3a.F90 as I mention (therein, I add in one equation). Is there something, such as the call commands that may be causing the NaN's?

Best regards and thanks for helping me on this problem,

Simon

comment:3 Changed 9 years ago by SimonDriscoll

"Indeed, it present in the runs (where no aerosol changes have been made and a normal branch of the UM is run, e.g. run xfzog)." —> "Indeed, it present in the runs where no aerosol changes have been made and a normal branch of the UM is run (e.g. run xfzog)."

comment:4 Changed 9 years ago by willie

Hi Simon,

The only difference between Xfzog (working) and xfgel (non-working) is that you've replaced the cmip5control branch with Improved_Volc branch.

I note that you have allocated an array with an extra dimension tr_vars. Note that allocate is a request for memory which may not be satisfied: you should add the 'stat = ' clause and check the value. If the allocate fails, then the situation is irrecoverable, and you should stop the program with a friendly message to say so.

You have assumed that tr_vars is at least 7 in the code. If it is less, then you will be accessing illegal memory. It is often a good idea to replace magic numbers like 7 with a parameter (e.g. integer, parameter max_tr_vars = 7).

I would also check that when passing values into a subroutine sensible values have arrived - failure here is often a source of difficult errors.

Beyond carefully checking the code, I cannot comment further on the meteorology/physics involved.

Regards,

Willie

comment:5 Changed 9 years ago by willie

Hi again Simon,

just how big is row_length x rows x tr_levels x tr_vars? At 8 bytes per entry. If it is more than 32Gbytes then you will be in trouble.

==Willie ==

comment:6 Changed 9 years ago by SimonDriscoll

Hi Willie,

thanks for your advice. Indeed, I am using an new branch. However, I do not know about tr_vars. Could you be more specific about where tr_vars is coming from? I have searched tr_vars in all the subroutines that I have modified, and the search fails for all files (although the search command works perfectly if I am to search a variable I know that is in there). The new branch was a merger between mine and Steve Hardiman's branch that I was told works correctly, and so in order to incorporate my changes I needed to perform a merger of the two changes so that the model understood how to read the differences in the code between mine and steve's branch. I have then been told that following the merger I can turn off Steve's and activate mine and this should enable mine correctly without conflict within the model. Is it possible that the version of Steve's branch that I was told would run doesn't? That's all I can think of if there's an error outside the subroutines I've modified. If you could tell me where tr_vars is, then I can ask Steve if he's also had an issue with it.

Best regards and thanks once again, much appreciated,

Simon

comment:7 Changed 9 years ago by willie

Hi Simon,

If you use the code browser in the Unified model wiki, you can see all the changes since you started the branch. Go to http://puma.nerc.ac.uk/trac/UM/browser/UM/branches/dev/SimonDriscoll, and click on the revision number 7917. Then click view changes. This shows all the changes you have made to the branch since it was copied from HadGEM2-ES.

The addition of the tr_vars occurs in atm_step.F90, the first bit of code.

It is always a good idea to check that the original code works before modifying it.

Regards,

Willie

comment:8 Changed 9 years ago by SimonDriscoll

Hi Willie,

again I've never heard of atm_step.F90, I can't see where I've 'added' anything. Certainly I have no interest in modifying anything to do with atm_step.F90, whose description is "! Description: Perform a 1-timestep integration of the Atmosphere Model,! including assimilation, physics and dynamics processing.", in other words I am utterly certain any change to this code has not been done wittingly.

Is there something that could have occured during the merger by accident?

"The addition of the tr_vars occurs in atm_step.F90, the first bit of code." Is there a way to tell what exactly has changed to this code 4 months ago? This is interesting though. I'm not completely convinced that the changes I made to the code explicitly as described above to glue_rad-rad_ctl3c.F90, r2_lwrad3c.F90, r2_swrad3c.F90, r2_set_aerosol_field-fill3a.F90 and r2_set_aero_clim_hadcm3-fill3a.F90 are wrong.

I believe we can rule out the warning on volcanic forcing as a possible error, it seems

"UM ERROR (Model aborting) :

Routine generating error: Bi_linear_h Error code: 10 Error message:

over-writing due to dim_e_out size"

and possible changes involving tr_vars (they could also be related) may be a problem.

Do you know if the apparent changes to tr_vars could be causing the model to abort with such an error?

Thinking about it I recall I have definitely ran a version of the model with Steve's branch alone turned on and without my branch activated. So it must be something to do with changes post-merger.

Thanks Willie,

Simon

comment:9 Changed 9 years ago by willie

Hi Simon,
Looking back at the history of the Improved_Volc branch we have

  • it was created at rev 6395 from HadGEM2-ES@2580
  • it was modified
  • at rev 6932 the changes from Steve Hardiman's cmip5control branch at rev 4660 and HadGEM2-ES@2580 (again) were merged in
  • further changes were made to result in revision 7917

I think it is important to determine at exactly what revision this branch was working and then proceed from there. Generally, the principle should be small changes and frequent testing.

When you do "view changes" the bits that you added are shown in green, whether you typed them yourself, or merged them in from another branch.

Regards,

Willie

comment:10 Changed 9 years ago by SimonDriscoll

Hi Willie,

aha. Ok, it seems something has been unwittingly changed during the merger. I'll take a look into this using the http://puma.nerc.ac.uk/trac/UM/browser/UM/branches/dev/SimonDriscoll (thanks for the link). I'll get back to you in a short while if something related to this crops up.

Cheers,

Simon

comment:11 Changed 9 years ago by willie

  • Resolution set to fixed
  • Status changed from new to closed

OK Simon, I'll close the ticket then.

Willie

Note: See TracTickets for help on using tickets.