Opened 9 months ago

Closed 7 months ago

#2387 closed help (fixed)

Difficulty generating timeseries in nesting suite

Reported by: nx902220 Owned by: willie
Priority: normal Component: UM Model
Keywords: IO server, time series Cc:
Platform: Monsoon2 UM Version: 10.5

Description (last modified by willie)

I have taken a copy of Willie McGinty's suite u-ar466 which is called u-at199. In Willies version of the suite w timeseries are output in the 500 m, 100 m and 55 m nests. I am trying to output timeseries in the ukv and 300 m nests also. I have created new time profiles and domains, and included them in optional .conf files. When I run the suite the ukv.pp9 file which should contain the timeseries is empty.

Please can you help me with this?

Best wishes,

Lewis

Change History (21)

comment:1 Changed 8 months ago by willie

  • Description modified (diff)
  • Owner changed from um_support to willie
  • Platform set to Monsoon2
  • Status changed from new to accepted
  • UM Version changed from <select version> to 10.5

comment:2 Changed 8 months ago by willie

Hi Lewis,

The UKV run is the thing that the optional configurations override. So the 100m STASH must be off for the UKV run i.e. not included. However, you have introduced new UKV STASH - this must therefore be included in the UKV run. So you don't need a 'ukv' package, but you do need to include that STASH item.

Regards
Willie

comment:3 Changed 8 months ago by nx902220

Hi Willie,

Thanks for your help. I'm still having difficulties. I have documented my effort with the suite below: (comment 23 is since I created a ticket with you)
https://code.metoffice.gov.uk/trac/roses-u/ticket/159#comment:27

In short:

I turned UKV w timeseries stash on in the main table, deleted the optional rose-app-ukv.conf file and left the package section in the ukv w timeseries stash request blank.

It fails in outputting the W timeseries after the 60th time step in ukv_um_fcst. job.out:

GENERAL_GATHER_FIELD : Cannot have time series in async dumps using IOS
WRITEDUMP: Call to GENERAL_GATHER_FIELD failed
Return code was 1
Error message was GENERAL_GATHER_FIELD : timeseries field with IOS not allowed
Field number 150
Dimensions 63 x 1080
Grid type 1
Field was not written out

????????????????????????????????????????????????????????????????????????????????

???!!!???!!!???!!!???!!!???!!! ERROR ???!!!???!!!???!!!???!!!???!!!
? Error code: 300
? Error from routine: UM_WRITDUMP
? Error message: Failure to gather field
? Error from processor: 0
? Error number: 21
????????????????????????????????????????????????????????????????????????????????

Please can you help with this?

Best wishes,

Lewis

comment:4 Changed 8 months ago by willie

Hi Lewis,

If you make changes to STASH you need to runs the STASH transform macros. In this case the index to the w wind times series 00150_2c8f434 is incorrect.

Regards
Willie

comment:5 Changed 8 months ago by nx902220

Hi Willie,

Thanks and apologies- I ran the macro after I added stash, but later when I made changes to stash I forgot that saving in the GUI would not update rose-app.conf.

I ran the tidy stash transform and I can now see the indexing is consistent between the main table in the GUI and rose-app.conf.

However, after running it overnight it has failed at the same place in ukv_um_fcst with the same error message.

Best wishes,

Lewis

comment:6 Changed 8 months ago by willie

Hi Lewis,

Could you try switching off IOS in the UKV run? You've been outputting time series in the nested LAMS and these have worked.

Originally in u-ar466, IOS was used for the 300m and 1km models, and the latter was switched off and the 300m model did not use time series. Now you have introduced time series for both the UKV and 300m model.

I think the key is

GENERAL_GATHER_FIELD : Cannot have time series in async dumps using IOS

though I am not sure why.

So in the suite.rc change the UKV model settings to

UM_ATM_NPROCY =36

FLUME_IOS_PROC=0

and the 300m model settings to

UM_ATM_NPROCY=16
FLUME_IOS_PROC=0

This maintains the number of processors but switches IOS off.

Willie

Last edited 8 months ago by willie (previous) (diff)

comment:7 Changed 8 months ago by willie

  • Keywords IO server, time series added

Hi Lewis,

Another, simpler, way is to switch off asynchronous dumping in IOS. This is in IO Server → Acceleration, ios_use_async_dumps.

Regards
Willie

comment:8 Changed 8 months ago by nx902220

Hi Willie,

Thank you. I followed your instruction in comment 7 and the suite has now run through the UKV, 500 m and 300 m nests.

When I check the .pp9 output in these 3 nests in xconv they all look right- the correct number of levels (under heading dx) and time steps (under heading nt), and the ukv.pp9 and 300m.pp9 are 21M and the 500m.pp9 is 29M.

I use xconv to convert the files to netcdf. This works for the 500m.pp9 file (I have plotted the timeseries in Python). However, the 300m.pp9 and ukv.pp9 do not convert- xconv gives the error: Error in xseek in get_data.

I have tried converting the field files to pp:
/common/um1/vn10.6/xc40/utilities/um-convpp ukv.pp9 ukv_w_timeseries_u-at199.pp
the resulting output file is 272 bytes for UKV and 300m but 2.1M for 500m.

This is really frustrating I thought we'd got there when the model ran. Please can you help me with this?

Thanks again,

Lewis


comment:9 Changed 8 months ago by willie

Hi Lewis,

I'm afraid the 300m and ukv files are corrupt. You can see this by using

mule-cumf 300m.pp9 300m.pp9

and comparing it with the 500m file. I suspect that using IO servers with the async dumps off as I suggested hasn't worked properly. If you could repeat the run using the method in comment 6 above, this should be more reliable.

Regards
Willie

comment:10 Changed 8 months ago by nx902220

Hi Willie,

In suite.rc I changed:

title = "[UM] Run ukv "

[environment?]

ONLYTO9 = false
UM_ATM_NPROCX = 8
UM_ATM_NPROCY = 36
FLUME_IOS_NPROC = 36

to

title = "[UM] Run ukv "

[environment?]

ONLYTO9 = false
UM_ATM_NPROCX = 8
UM_ATM_NPROCY = 36
FLUME_IOS_NPROC = 0

The job fails at ukv_um_fcst with error message:

????????????????????????????????????????????????????????????????????????????????
???!!!???!!!???!!!???!!!???!!! ERROR ???!!!???!!!???!!!???!!!???!!!
? Error code: 100
? Error from routine: UM_SHELL
? Error message: UM started on 324 PEs but 288 asked for. Please adjust decomposition
? Error from processor: 297
? Error number: 0
????????????????????????????????????????????????????????????????????????????????

So we are 36 processors short. I'm guessing we asked for 36x8=288 processors.

Could we do
UM_ATM_NPROCX = 9
UM_ATM_NPROCY = 36

so that 9x36=324?

Best wishes,

Lewis

comment:11 Changed 8 months ago by willie

Hi Lewis,

Yes, that's what I should've said. Remember to do the 300m model too.

Regards
Willie

comment:12 Changed 8 months ago by nx902220

Hi Willie,

I did UKV:
UM_ATM_NPROCX = 9
UM_ATM_NPROCY = 36
FLUME_IOS_NPROC = 0

it failed with error

????????????????????????????????????????????????????????????????????????????????
???!!!???!!!???!!!???!!!???!!! ERROR ???!!!???!!!???!!!???!!!???!!!
? Error code: 2
? Error from routine: DECOMPOSE_FULL
? Error message: Cannot run with an odd ( 9) number of processors in the East-West direction.
? Error from processor: 0
? Error number: 17
????????????????????????????????????????????????????????????????????????????????

Do I need 324 processors?
If so I could try
UM_ATM_NPROCX = 12
UM_ATM_NPROCY = 27
FLUME_IOS_NPROC = 0

What do you think? Cheers,

Lewis

comment:13 Changed 8 months ago by willie

Hi Lewis,
Yes, that should work. Or you could try 10x36 and change to

            TOTAL_MPI_TASKS = 360

But if you do that then you'll need TOTAL_MPI_TASKS/MPI_TASKS_PER_NODE = 360/9 = 40 nodes, so you would then need to change select=36 to select=40 in the directives section.

Willie

comment:14 Changed 8 months ago by nx902220

Hi Willie,

I'm trying what you suggest in comment 13. It is taking a long time because it keeps failing at ukv_um_fcst with the segmentation error we saw before. I keep re-triggering and in the past it would work after 4 re-triggers but I am on 6. Each time I re-trigger it queues all day and does not run until the night. So it has taken me a week so far. Is there a way of making this quicker?

Best wishes,

Lewis

comment:15 Changed 8 months ago by willie

Hi Lewis,

Just to summarise the error, we're getting a segmanetation fault at the 60th time step after outputting some STASH,

[71] exceptions: An exception was raised:11 (Segmentation fault)
[71] exceptions: the exception reports the extra information: Address not mapped to object.
[71] exceptions: whilst in a serial region
[71] exceptions: Task had pid=1791 on host nid00877
[71] exceptions: Program is "/home/d04/lblunn/cylc-run/u-at199/share/fcm_make/build-atmos/bin/um-atmos.exe"
[71] exceptions: calling registered handler @ 0x20019d80
Warning in umPrintMgr: umPrintExceptionHandler : Handler Invoked
[71] exceptions: Done callbacks
[71] exceptions: *** GLIBC ***
[71] exceptions: Data address (si_addr): 0x10013a01000; rip: 0x24630940
[71] exceptions: [backtrace]: has   6 elements:
[71] exceptions: [backtrace]: (  1) : Address: [0x24630940] 
[71] exceptions: [backtrace]: (  1) : __cray_dcopy_HSW (* Cannot Locate *)
[71] exceptions: [backtrace]: (  2) : Address: [0x2001c5ca] 
[71] exceptions: [backtrace]: (  2) : signal_do_backtrace_linux in file /home/d04/lblunn/cylc-run/u-at199/share/fcm_make/preprocess-atmos/src/um/src/control/c_code/exceptions/exceptions-platform/exceptions-linux.c line 78
[71] exceptions: [backtrace]: (  3) : Address: [0x2001a73b] 
[71] exceptions: [backtrace]: (  3) : signal_do_backtrace in file /home/d04/lblunn/cylc-run/u-at199/share/fcm_make/preprocess-atmos/src/um/src/control/c_code/exceptions/exceptions.c line 270
[71] exceptions: [backtrace]: (  4) : Address: [0x2001ae37] 
[71] exceptions: [backtrace]: (  4) : signal_handler in file /home/d04/lblunn/cylc-run/u-at199/share/fcm_make/preprocess-atmos/src/um/src/control/c_code/exceptions/exceptions.c line 672
[71] exceptions: [backtrace]: (  5) : Address: [0x23646e70] 
[71] exceptions: [backtrace]: (  5) : __restore_rt in file sigaction.c line 672
[71] exceptions: [backtrace]: (  6) : Address: [0x24630940] 
[71] exceptions: [backtrace]: (  6) : __cray_dcopy_HSW (* Cannot Locate *)

This has been occurring intermittently, and we've gotten round this simply by re-triggering the ukv_um_fcast task.

I had hoped that by giving the task more processors (we've gone from a total of 288 to 360) that this type of error would become less likely, but that clearly has not happened.

I will try to run at a reduce optimisation to see if that makes any difference.

Regards
Willie

comment:16 Changed 8 months ago by willie

Hi Lewis,

You could revert to 8x36 =288 processors, with the FLUME_IOS_NPROC=0 and select=32 in case that's any better. We just need to switch IOS off to get the time series out.

Willie

comment:17 Changed 7 months ago by nx902220

Hi Willie,

I did as you said comment 16. It failed over the weekend with segmentation fault but when I re-triggered at 9 AM this morning the UKV ran fine and the suite is now on 500 m.

Unfortunately the ukv.pp9 file is still corrupt.

Best wishes,

Lewis

comment:18 Changed 7 months ago by willie

Hi Lewis,

I have created another ticket for the intermittent failure problem #2429.

I looked at the ukv.pp9 in mule-cumf

frmy@xcs-c$ mule-cumf ukv.pp9 ukv.pp9
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
* (CUMF-II) Module Information *
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
mule       : /projects/um1/lib/python2.7/mule/__init__.py (version 2017.06.1)
um_utils   : /projects/um1/lib/python2.7/um_utils/__init__.py (version 2017.06.1)
um_packing : /projects/um1/lib/python2.7/um_packing/__init__.py (version 2017.06.1) (packing lib from SHUMlib: 2017061)


/projects/um1/lib/python2.7/mule/validators.py:182: UserWarning: 
File: ukv.pp9
Field validation failures:
  Fields (0)
Skipping Field validation due to irregular lbcode: 
  Field lbcode: 31320
  warnings.warn(msg)
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
* CUMF-II Comparison Report *
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

File 1: ukv.pp9
File 2: ukv.pp9
Files compare
  * 0 differences in fixed_length_header (with 7 ignored indices)
  * 0 field differences, of which 0 are in data

Compared 1/1 fields, with 1 matches

This basically says that mule-cumf was able to read the file without a problem apart from the irregular lbcode (see UMDP F03). So it looks good.

Regards
Willie

comment:19 Changed 7 months ago by nx902220

Hi Willie,

Yes thank you. I got confused when trying to look at the data in xconv. I have viewed it in Python.

There was a segmentation fault in 300 m but I re-triggered and it is now half way through 300 m forecast.

It looks like we have got there!

Thank you very much Willie.

Best wishes,

Lewis

comment:20 Changed 7 months ago by willie

  • Status changed from accepted to new

comment:21 Changed 7 months ago by willie

  • Resolution set to fixed
  • Status changed from new to closed
Note: See TracTickets for help on using tickets.