Opened 4 weeks ago

Last modified 3 weeks ago

#2277 accepted help

u-ap276 MONSooN nesting suite UM_WRITDUMP error

Reported by: nx902220 Owned by: willie
Priority: normal Component: UM Model
Keywords: STASH indices Cc:
Platform: Monsoon2 UM Version: 10.5

Description

Hi,

I am running a nesting suite u-ap276 which is being modified to allow tracer release through UM Science Settings > Section 33.

The suite runs successfully through the UKV, 500m and 300m nests until the 100m nest where it fails with job.err:

????????????????????????????????????????????????????????????????????????????????
???!!!???!!!???!!!???!!!???!!! ERROR ???!!!???!!!???!!!???!!!???!!!
? Error code: 300
? Error from routine: UM_WRITDUMP
? Error message: Failure to gather field
? Error from processor: 1000
? Error number: 22
????????????????????????????????????????????????????????????????????????????????

It cannot make the first dump file dymeea_da001.

From the UM_WRITDUMP code:
https://code.metoffice.gov.uk/trac/um/browser/main/trunk/src/control/dump_io/um_writdump.F90

it seems to be some sort of processor error that I don't understand.

If I run the suite without tracers then I do not get the error.

I will be very grateful for your support.

Best wishes,

Lewis

Change History (6)

comment:1 Changed 4 weeks ago by willie

  • Keywords STASH indices added
  • Owner changed from um_support to willie
  • Platform set to Monsoon2
  • Status changed from new to accepted

Hi Lewis,

When you add new STASH you need to run the STASH macro (TidyStashTransfrom?) to generate indices for any new requests or profiles. It's under Metadata > um in the Rose GUI.

Regards
Willie

comment:2 Changed 3 weeks ago by nx902220

Hi Willie,

Thanks for getting back to me.

I did as you said and then I ran the nesting suite from the start again. Build and UKV succeed but 500m fails in 500m_um_fcst_00. job.err:

????????????????????????????????????????????????????????????????????????????????
???!!!???!!!???!!!???!!!???!!! ERROR ???!!!???!!!???!!!???!!!???!!!
? Error code: 1
? Error from routine: UM_SHELL
? Error message: ERROR,RDBASIS: LEVELS LIST HAS NO ENTRIES
? Error from processor: 0
? Error number: 17
????????????????????????????????????????????????????????????????????????????????

[0] exceptions: An non-exception application exit occured.
[0] exceptions: whilst in a serial region
[0] exceptions: Task had pid=65280 on host nid06309
[0] exceptions: Program is "/home/d04/lblunn/cylc-run/u-ap276/share/fcm_make/build-atmos/bin/qxatmos.exe.PS38"
[0] exceptions: calling registered handler @ 0x20019d80
Warning in umPrintMgr: umPrintExceptionHandler : Handler Invoked
[0] exceptions: Done callbacks
Rank 0 [Mon Sep 25 13:49:41 2017] [c4-2c2s9n1] application called MPI_Abort(MPI_COMM_WORLD, 9) - process 0

I cannot work out why I am getting this error.

I will be grateful for your help.

Best wishes,

Lewis

comment:3 Changed 3 weeks ago by willie

Hi Lewis,

There seem to be a number of problems with this suite. There are a lot of metadata warnings causing some functionality to be switched off, inconsistent template variables and STASH validate errors. Has this ever worked?

If you're looking for a nesting suite to begin developing, you could try the 2017 example on the Nesting Suite tutorial page - https://code.metoffice.gov.uk/trac/rmed/wiki/suites/nesting. This is a very modern suite and there are a number of people using it in their research.

Regards
Willie

comment:4 Changed 3 weeks ago by nx902220

Hi Willie,

I am taking a nesting suite from Humphrey Lean and trying to modify it to include tracer release.

I took Humphrey's suite u-am764 and copied it. This suite was called u-ao177 and I modified it to run successfully on MONSooN. However, it doesn't include tracer release.

My suite u-ap276 is taking u-ao177 and trying to include tracer release so I need to start working from u-ao177.

Best wishes,

Lewis

comment:5 Changed 3 weeks ago by willie

Hi Lewis,

I looked at u-ao177 and only the 55m_um tasks have run plus some archiving. But in u-ap276 you have the 500m tasks switched on and it is failing in 500m_fcst_00, as you said, something you have not modified.

You also need to run the checker (Validate) macros to verify STASH items are
available and are set up correctly - slide 46 in http://cms.ncas.ac.uk/documents/training/RoseMay2017/UM_conversion_presentations.pdf. There is currently one error that needs to be corrected. These STASH errors are also present in the original u-am764 job.

The trace back for the 500m_fcst_00 error is

ATP Stack walkback for Rank 0 starting:
  [empty]@0xffffffffffffffff
  um_main_@um_main.F90:19
  um_shell_@um_shell.F90:709
  ereport64$ereport_mod_@ereport_mod.F90:169
  gc_abort_@gc_abort.F90:136
  mpl_abort_@mpl_abort.F90:43
  pmpi_abort@0x242db19c
  PMPI_Abort@0x24305c34
  MPID_Abort@0x24333c71
  abort@abort.c:92
  raise@pt-raise.c:42
ATP Stack walkback for Rank 0 done
Process died with signal 6: 'Aborted'
Forcing core dump of rank 0

This shows that the model has not started to process any time steps, and is still in the set up phase. This is almost certainly due to errors in the suite.

It is always a good idea to start from a standard suite that is known to run. Getting an arbitrarily chosen suite to run can involve many months of effort.

Regards
Willie

comment:6 Changed 3 weeks ago by nx902220

Hi Willie,

There is output from all of the nests in the u-ao177 suite since I ran them all. I used true and false in rose-suite.conf to run each nest individually.

In u-ap276 when I run TidyStashValidate? I get an "Identical sections: namelist…" error.
I removed one of the identical sections so I am hoping that will be OK now.

Unfortunately my PhD project is on modelling turbulence and pollution dispersion at high resolution over London so I need to use the u-am764 suite developed by Humphrey Lean, Kirsty Hanley and Sylvia Bohnestengel. I realise that what I am doing requires considerable effort but I have no choice. It took me a long time to get their suite running on MONSooN without tracer release.

I am not sure how to resolve the 500m run issue. The UKV nest works and outputs so I think the error is specific to the 500m nest. I look at
https://code.metoffice.gov.uk/trac/um/browser/main/trunk/src/control/top_level/um_shell.F90
and I cannot work it out from there.
Do you think it might be possible for me to sit down with you for 15mins? I'm on a steep learning curve and if you can give me some advice it would be much appreciated.

Best wishes,

Lewis

Note: See TracTickets for help on using tickets.