Opened 5 years ago

Closed 5 years ago

#1331 closed help (fixed)

cant find required STASH item 1 section 39 model 1 in STASHmaster

Reported by: cwright Owned by: um_support
Component: UM Model Keywords: monsoon, stashmaster
Cc: Platform: MONSooN
UM Version: 6.6.3

Description (last modified by ros)

Hi,

I'm still working to move a job from Archer to Monsoon (see http://cms.ncas.ac.uk/ticket/1310). I think I've located all the ancillaries now, but I'm getting a new error when I try to run the model, as shown below.

The job is xkbrb. It's possible (probable) that the error is due to a change I've made while moving the job between machines, in which case the majority of changes I've made between the Archer version and this version are in the edit history of xjzia (I'm not sure why I created xkbrb separately, xjzia has never worked on Monsoon either! I'll consolidate them once it's working). In particular, a few of the ancillaries didn't have the same names or paths on the two machines, so while I took educated guesses for most of them there's no guarantee I got them right.


 Error Code:-  2
 Error Message:-  Cant find required STASH item  1  section  39  model  1  in STASHmaster
 Error generated from processor  0
/nerc/clpredic/cwright/um/xkbrb/bin/qsexecute: Error in dump reconfiguration - see OUTPUT
 Error Code:-  2
 Error Message:-  Cant find required STASH item  1  section  39  model  1  in STASHmaster
 Error generated from processor  0
 ERROR!!! in reconfiguration in routine Rcf_Exppx
 Error Code:-  2
 Error Message:-  Cant find required STASH item  1  section  39  model  1  in STASHmaster
 Error generated from processor  0
/nerc/clpredic/cwright/um/xkbrb/bin/qsexecute: Error in dump reconfiguration - see OUTPUT
/nerc/clpredic/cwright/um/xkbrb/bin/qsfinal: Model xkbrb - Error: No history files
 ERROR!!! in reconfiguration in routine Rcf_Exppx
 Error Code:-  2
 Error Message:-  Cant find required STASH item  1  section  39  model  1  in STASHmaster
 Error generated from processor  0

Change History (8)

comment:1 Changed 5 years ago by grenville

Corwin

Sorry for the late reply - is this still a problem?

Best

Grenville

comment:2 Changed 5 years ago by cwright

Hi Grenville

I've partially resolved it by copying Guiying Yang's job xkdgb, which is very similar and which Ros had previously helped her get working on Monsoon. My copy of that compiles and runs absolutely fine, so I think we've definitely broken the back of the problem.

However, the next step, to try and get it running nudged mode, is proving more problematic: specifically, when I try and run the nudged model instead of the "normal" 6.6.3 (job xkhzb for nudged versus job xkhzb for "normal") it doesn't compile. There are two .comp.leave files where I've tried this (on Monsoon):

~cwright/output/xkhzb000.xkhzb.d14231.t145527.comp.leave
~cwright/output/xkhzb000.xkhzb.d14238.t171332.comp.leave

and in which the main problem appears to be:

Base build: OK
poserror === End of Compilation 1 ===
ioerror === End of Compilation 1 ===
gmake: * [nudging_netcdf_dimreader.o] Error 1
gmake:
* [nudging_netcdf_dimreader.o] Error 1
1501-511 Compilation failed for file nudging_netcdf_dimreader.f90.
fcm_internal compile failed (256)
gmake -f /home/cwright/xkhzb/ummodel/Makefile -j 6 all failed (2) at /projects/um1/fcm/bin/../lib/Fcm/Build.pm line 611

For reference, the nudged run should be the same as the original, except with the hand edit in ~cwright/umui_jobs/hand_edits/nudge_ecmwf.ed (on Puma) applied and the FCM user working copy modifications in ~cwright/nudge_hirdls/hg6.6.3_nudge_hirdls (also on Puma).

nudging_netcdf_dimreader.F90 is present in ~cwright/nudge_hirdls/hg6.6.3_nudge_hirdls/src/atmosphere/nudging on Puma, and I don't think there are any significant versions from the version that worked on Archer (xiwxb). Should the file also be somewhere on Monsoon?

n.b. Ros also emailed me about this a few days ago as we discussed it briefly when I was in Reading a few weeks ago; I've just replied to her saying basically the same as this post.

Corwin

comment:3 Changed 5 years ago by ros

Hi Corwin,

A few lines above the error message you posted above there is an error message that says it is failing to find the Netcdf library.

"/home/cwright/xkhzb/ummodel/ppsrc/UM/atmosphere/nudging/nudging_netcdf_dimreader.f90", line 38.11: 1514-219 (S) Unable to access module symbol file for mod
ule netcdf. Check path and file permissions of file. Use association not done for this module.

In window Submodel-independent → Compilations and modifications → User Override files please include my machine override file ~ros/umui_jobs/overrides/monsoon/2c_netcdf_6.6.3.ovr to add in the paths to the NetCDF library on MONSooN.

You should find the job will then compile successfully.

Cheers,
Ros.

comment:4 Changed 5 years ago by ros

  • Description modified (diff)

comment:5 Changed 5 years ago by cwright

Hi Ros,

that fixes the compile bug, and it now runs out a few time steps. However, it crashes on (I think) time step 144, with the following error (from xkhzb000.xkhzb.d14239.t104313.leave)

*
UM ERROR (Model aborting) :
Routine generating error: Bi_linear_h
Error code: 10
Error message:

over-writing due to dim_e_out size

*

There are several NaNs? on the lines immediately above that:

Atm_Step: Timestep 144
Before resetting tracers (AtmStep?):
Tracer1: 1.00000000000000000 0.000000000000000000E+00 NaNQ
Tracer2: 1.00000000000000000 0.281015045551440691E-59 NaNQ
Tracer3: 1.00000000000000000 0.663204960434226440E-58 NaNQ
Tracer4: 0.214384518870391569E-21 0.000000000000000000E+00 NaNQ
Tracer5: 171602.882744588511 0.000000000000000000E+00 NaNQ
Tracer6: 0.201683183259067287E-168 0.000000000000000000E+00 NaNQ
Tracer7: 1.00000000000000000 0.000000000000000000E+00 NaNQ
After resetting tracers (AtmStep?):
Tracer1: 1.00000000000000000 0.000000000000000000E+00 NaNQ
Tracer2: 1.00000000000000000 0.281015045551440691E-59 NaNQ
Tracer3: 1.00000000000000000 0.663204960434226440E-58 NaNQ
Tracer4: 0.214384518870391569E-21 0.000000000000000000E+00 NaNQ
Tracer5: 171602.882744588511 0.000000000000000000E+00 NaNQ
Tracer6: 0.201683183259067287E-168 0.000000000000000000E+00 NaNQ
Tracer7: 1.00000000000000000 0.000000000000000000E+00 NaNQ

which may be related?

comment:6 Changed 5 years ago by ros

Hi Corwin,

The model has definitely become unstable at timestep 143, however, I notice that at Timestep 1 there are NaN's present in the Maximum Vertical Velocity at levels 19, 28 and 33 so I think there might be something wrong with your initial conditions? Try cumf'ing the start dump (and any non standard ancils) with itself. For normal files the comparison should result in no difference; for files with NaNs the difference is written in the summary file.

Regards,
Ros.

comment:7 Changed 5 years ago by cwright

Hi Ros,

I suspected the problem wasn't in the initial conditions for the model per se, as they were the same as for the unnudged runs, so I investigated the nudging data and associated hand-edits as they were the only changed raw input. The data was fine, with no NaNs? anywhere, but the problem turned out to be fixable by modifying the controlling hand-edit file.

Specifically, I was able to fix it by altering the top height at which nudging was applied for level 54 to level 51. Level 54 was fine on Archer, so I'm a bit puzzled about why this works (maybe a subtle difference in how the Archer and Monsoon netCDF libraries handle edge cases? This is a blind guess!), but it doesn't change the science aim of why I was doing it - the important part of the nudging is lower down anyway, and the top was chosen fairly arbitrarily - so I'm happy that this fixes it.

I've now successfully run out all three of the cases I need (unnudged, ECMWF-nudged, and HIRDLS-nudged) to the end of the period of interest. So, I think I've finally got everything resolved, and *hopefully* now it's just a case of either Yang or I turning the handle on the runs themselves. Thanks for all the help :-)

Corwin

comment:8 Changed 5 years ago by ros

  • Resolution set to fixed
  • Status changed from new to closed

Hi Corwin,

Thanks for letting us know what the problem was and I'm glad you were able to track down the issue.

I shall close this ticket now. Please reopen or raise a new ticket if you have any further issues.

Regards,
Ros.

Note: See TracTickets for help on using tickets.