Opened 6 months ago

Closed 2 weeks ago

#2443 closed help (wontfix)

u-aw776: problem including tracers in lateral boundary conditions of nesting suite

Reported by: nx902220 Owned by: willie
Priority: normal Component: UM Model
Keywords: craype-hugepages2M Cc:
Platform: Monsoon2 UM Version:

Description

Hi Willie,

I have copied u-at199 revision 74133 which runs all the way through without tracers. I then made the following changes:

https://code.metoffice.gov.uk/trac/roses-u/changeset?old_path=%2Fa%2Fw%2F7%2F7%2F6%2Ftrunk&old=74485&new_path=%2Fa%2Fw%2F7%2F7%2F6%2Ftrunk&new=74215

To include tracers and feed the UKV tracers into the lateral boundary conditions of the 500 m nest. I get the following error at 500 m forecast:

Warning in umPrintMgr: umPrintExceptionHandler : Handler Invoked[989] exceptions: An exception was raised:11 (Segmentation fault)

I have just tried removing the tracer from the lateral boundary conditions and then tracers are output fine. So the problem must be associated with the lateral boundary conditions.

Please can you help me with this?

Best wishes,

Lewis

Attachments (1)

Note on Ticket 2443.pdf (54.4 KB) - added by willie 2 weeks ago.

Download all attachments as: .zip

Change History (21)

comment:1 Changed 6 months ago by willie

Hi Lewis,

Try looking through the error logs to see if there are any issues that need correcting. I always

cd ~/cylc-run/<suite-id>/log/job
find . -name job.err -exec ls -l {} \;

This gives a nice list of the error files and their sizes and you can easily look through them.

Regards
Willie

comment:2 Changed 6 months ago by nx902220

Hi Willie,

I ran the model again but in debug mode this time. The only error I can find is in 500 m forecast stage(there are warnings in job.err from other processes but these are also in successful runs of the suite without tracers so I don't think they are the cause of the problem). From job.out I can see it fails during the first time step.

The segmentation fault I am getting is the same as one you previously helped me with in an earlier version of the suite. The fix was to turn off smag s stash request (which I am still doing since I have taken the change forward in newer versions of the suite). Maybe there is a similar fix to this problem.
http://cms.ncas.ac.uk/ticket/2429

Maybe I am doing something silly but I have thought through what I'm doing and have discussed with more experienced UM users. I have checked the error messages but cannot make any progress with them.

If you have any advice I would be much appreciative.

Best wishes,

Lewis

comment:3 Changed 6 months ago by grenville

Lewis

job.err has a back trace which shows that the problem is at line 149 in

/home/d04/lblunn/cylc-run/u-aw776/share/fcm_make/preprocess-atmos/src/um/src/control/stash/stash.F90

I'd try switching of stash systematically to see which one might be the problem, alternatively, you could put in some simple write statements around line 194 to see what its objecting to.

Grenville

(Willie is out for a few days)

comment:4 Changed 5 months ago by nx902220

Hi Grenville,

I've tried turning off all stash requests other than the tracer stash request. I then get past segmentation fault in 500 m forecast. (I had to run UKV and 500m separately since UKV stash needs to be on to create 500m initial and boundary condition files). This suggests the problem is with another stash than tracer. For me to go through all of the stash systematically would take months since one job run takes at least a day due to queuing.

I'm not sure how to do the write statement test you propose. If you still think that would be useful please can you tell me how to do it?

Best wishes,

Lewis

comment:5 Changed 5 months ago by willie

Hi Lewis,

Are you still getting the STASH error? The job seems to have carried on past the point in comment 4.

Regards
Willie

comment:6 Changed 5 months ago by nx902220

Hi Willie,

If you are referring to the model run which is running at the moment then yes it has got past the STASH error.

This is because I turned all stash request off except for the tracer stash request. The STASH error must be associated with a stash request other than tracer.

I need to be able to run the suite with most stash requests on e.g. timeseries stash request and stash requests required to make LBCs for the next nest.

I suppose one test would be to turn all of those stash requests on and turn off those which are not important in the hope it is one of the unimportant stash requests which is the problem.

Cheers,

Lewis

comment:7 Changed 5 months ago by willie

Hi Lewis,

Can you back up to the point where the STASH error occurred and repeat the run to get the error again. Then I'll be able to have a more detailed look.

Regards
Willie

comment:8 Changed 5 months ago by nx902220

Hi Willie,

I have just run the 500 m nest with STASH for tracer and all of the prognostics used to make boundary conditions for the 300 m nest turned on without error.

I will try running it again with more stash turned on but only those that I know I need (some before were just on because they were when I picked up the suite).

If I get the STASH error when I do this I will let you know.

Cheers,

Lewis

comment:9 Changed 5 months ago by nx902220

Hi Willie,

The run I mentioned in comment 8 has failed with the STASH error. The previous run with less stash turned on did not fail.

A change set between these 2 runs is here:
https://code.metoffice.gov.uk/trac/roses-u/changeset?old_path=%2Fa%2Fw%2F7%2F7%2F6%2Ftrunk%2Fapp%2Fum&old=77900&new_path=%2Fa%2Fw%2F7%2F7%2F6%2Ftrunk%2Fapp%2Fum&new=78216

If you could help me find out what the problem is it would be much appreciated.

Cheers,

Lewis

comment:10 Changed 5 months ago by willie

  • Owner changed from um_support to willie
  • Status changed from new to accepted

Hi Lewis,

I have made some progress with this. Firstly, it has nothing to do with STASH. I increased the process count for the 500m forecast job to 36x36=1296 cores and this changed the error to

?  Error code: 300
?  Error from routine: UM_WRITDUMP
?  Error message: Failure to gather field
?  Error from processor: 1079
?  Error number: 22

after 360 time steps (= one hour) when writing dymeca_da000.

Switching on flush buffers (IO System settings > print Manager control pmt_force_flush) reveals further information:

WRITEDUMP: Call to GENERAL_GATHER_FIELD failed
Return code was  10
Error message was GENERAL_GATHER_FIELD : Field type not recognized
Field number  2
Dimensions  600  x  600
Grid type  4617315517961601024
Field was not written out

I don't yet know what field 2 is. The problem is the grid type is wrong. There are two possibilities: either there is another error corrupting the grid type, or else it is something you have changed in the 500m forecast set up.

Do you have an example of dymeca_da000 from a previous closely related run that we could compare with?

Regards
Willie

comment:11 Changed 5 months ago by nx902220

Hi Willie,

Thank you for exploring this.

I have output from a similar suite in the following directory:
~/cylc-run/u-aw764/share/cycle/20160504T0300Z/500m_um
Although there is not a dymeca_da000. The first output is dymeca_da001.

It is probably unrelated but I thought I should mention anyway that I have always had a warning that doesn't stop the run in job.err for ukv_start (it happens for all of my nesting suites):

[lblunn@exvmsrose:~/cylc-run/u-aw764/log/job/20160504T0300Z/ukv_start/01]$ cat job.err
/projects/um1/lib/python2.7/mule/validators.py:182: UserWarning?:
File: /home/d04/lblunn/cylc-run/u-aw764/share/cycle/20160504T0300Z/ukv_frame
Field validation failures:

Fields (0,1,2,3,4, … 31213 total fields) Cannot validate field due to incompatible grid type:
File grid : 0
Field grid: 3
warnings.warn(msg)

Best wishes,

Lewis

comment:12 Changed 4 months ago by willie

  • Keywords craype-hugepages2M added
  • Platform set to Monsoon2

Hi Lewis,

In a wild and random move, I now have the 500m model working with tracers. See my u-ay522. The secret is to delete the huge pages module from the 500m environment in suite.rc. I have no idea why this works. So, nothing to do with STASH, processors or writing dumps. It's probably a good idea to make some of the changes I suggested in other notes to you, to make the model a bit easier to use.

regards
Willie

comment:13 Changed 4 months ago by willie

  • Resolution set to fixed
  • Status changed from accepted to closed

comment:14 Changed 4 months ago by nx902220

Hi Willie,

I took a copy of your suite u-ay522 and it is called u-az169. A change set between the 2 suites is below:
https://code.metoffice.gov.uk/trac/roses-u/changeset?old_path=%2Fa%2Fy%2F5%2F2%2F2&old=81620&new_path=%2Fa%2Fz%2F1%2F6%2F9&new=82951

The suite successfully runs 500m_fcst_00 and I have checked output.

The suite fails at 500m_fcst_01 with job.err:

doStderr: received 397 bytes ALPSSTDIO_MSGTEXT
?????????????????????????????? WARNING ??????????????????????????????
? Warning code: -20
? Warning from routine: CHECK_DUMP_PACKING
? Warning message:
? Packing codes in dump inconsistent with DUMP_PACKim.
? Packing codes updated.
? Warning from processor: 0
? Warning number: 20
????????????????????????????????????????????????????????????????????????????????

control_loop: numPoll 4, poll returned 1
control_loop: received message on control fd 14
processControlMsg: received stderr message
stderr: <?xml version="1.0"?><methodCall><methodName>stderr</methodName><params><param><value><struct><member><name>msgtext</name><value><base64>WzExMjldIGV4Y2VwdGlvbnM6IEFuIGV4Y2VwdGlvbiB3YXMgcmFpc2VkOjE.
.
.
.
BjYWxsaW5nIHJlZ2lzdGVyZWQgaGFuZGxlciBAIDB4MjAwMTlkODAK
</base64></value></member><member><name>nid</name><value><int>1012</int></value></member><member><name>fd</name><value><int>0</int></value></member></struct></value></param></params></methodCall>
doStderr: received 840 bytes ALPSSTDIO_MSGTEXT
[1129] exceptions: An exception was raised:11 (Segmentation fault)
[1131] exceptions: An exception was raised:11 (Segmentation fault)
[1129] exceptions: the exception reports the extra information: Sent by the kernel.
[1131] exceptions: the exception reports the extra information: Sent by the kernel.
[1129] exceptions: whilst in a serial region
[1131] exceptions: whilst in a serial region
[1129] exceptions: Task had pid=68390 on host nid01012
[1131] exceptions: Task had pid=68392 on host nid01012
[1129] exceptions: Program is "/home/d04/lblunn/cylc-run/u-az169/share/fcm_make/build-atmos/bin/um-atmos.exe"
[1131] exceptions: Program is "/home/d04/lblunn/cylc-run/u-az169/share/fcm_make/build-atmos/bin/um-atmos.exe"
[1129] exceptions: calling registered handler @ 0x20019d80
[1131] exceptions: calling registered handler @ 0x20019d80

Best wishes,

Lewis

comment:15 Changed 4 months ago by willie

  • Resolution fixed deleted
  • Status changed from closed to reopened

Hi Lewis,
OK. I'll have a look - I'm not very hopeful though.

Regards
Willie

comment:16 Changed 4 months ago by willie

  • Status changed from reopened to accepted

comment:17 Changed 7 weeks ago by willie

Just a catchup on off ticket activities to 15th August 2018:

Well I think I've got it working! This is my u-az946. The change here is to remove ":coretype=broadwell" from the suite.rc file everywhere. This suite is derived from my u-ay522 which contains a number of beneficial changes:

  • Use CreateBC 11.1 code with 10.5 STASH master
  • Removed unused STASH master files in app/um/file and inserted 10.5 STASH master

This allows the 500m_um_fcst task to complete. I didn't look at the output data.

So I think you should make these changes to your model and try a full tracer run.

I still don't know the cause of this problem.

Willie

comment:18 Changed 7 weeks ago by willie

Hi Willie,

My copy of your suite u-az946@85909 fails at 500 m when I turn on tracer lateral boundary conditions, and remove murk and aerosol stash request.

Fails at 500 m forecast time step 360 (one hour) when dumping dymeca_da001:

????????????????????????????????????????????????????????????????????????????????
???!!!???!!!???!!!???!!!???!!! ERROR ???!!!???!!!???!!!???!!!???!!!
? Error code: 300
Rank 2 [Tue Aug 28 12:05:25 2018] [c8-2c1s15n0] application called MPI_Abort(MPI_COMM_WORLD, 9) - process 2
? Error from routine: UM_WRITDUMP
? Error message: Failure to gather field
? Error from processor: 11
? Error number: 22
????????????????????????????????????????????????????????????????????????????????

I have previously had this error in tcket #2443#comment:10

Change set between u-ba463@88557 (the suite which failed) and u-az946@85909 (your suite I copied):

https://code.metoffice.gov.uk/trac/roses-u/changeset?old_path=%2Fb%2Fa%2F4%2F6%2F3&old=88557&new_path=%2Fa%2Fz%2F9%2F4%2F6&new=85909

I do not think I have made any mistakes in my changes. The bug seems very arbitrary in where it appears. I would not have thought turning things off should cause problems! My suite u-ba463 has run to 55m_forecast_00 with lateral boundary conditions, murk and aerosol stash request on.

Do you have any more ideas I can test?

Cheers,
Lewis


comment:19 Changed 7 weeks ago by willie

Hi Lewis,

These changes seem very reasonable. There is some evidence (see the UM11.0 release notes) that having huge pages on can conflict with the optimisation. So, try deleting

module load craype-hugepages8M

every where it appears in the suite.rc file. You'll need to do a rebuild. Then try a run.

Regards
Willie

Changed 2 weeks ago by willie

comment:20 Changed 2 weeks ago by willie

  • Resolution set to wontfix
  • Status changed from accepted to closed

Just a summary before I formally abandon this ticket.

In a final email I said,

The error 300 problem (failure to gather field) was occurring in the 500 m model, so you seem to have made some progress getting down to the 100m model. But it is the same error message that keeps occurring. I wrote to Lewis with the attached, outlining my progress, or lack of it. I have emailed Stuart Whitehouse regarding the CreateBC problem and got a helpful reply which has been implemented. I emailed Roger Milton about using a later version of the compiler to see if that would help and found that I was already using the latest version available. I have also switched on Dr.Hook but that made the model run 30x slower and has shed no further light. I could find no evidence of a lack of memory.

The problem occurs when it tries to save the first of the hourly dumps. The error message is not correct anyway. The fundamental reason is that the dump has an incorrect grid type and this is flagged as error 300. But even this is not the true error. Something is overwriting the dump just before it is to be written out - there is no reason to modify the grid type. I don't know what that something is. It might be incorrect compiler optimisation, although I did try a run with no optimisation, to no benefit. It could be a bug in the UM 10.6 code. I did try to use the Rose upgrade macros to move to 10.7 but this introduced new errors. If there is a way the science setup could be moved to UM 11.1, that might shed further light on the issue.

Willie

Note: See TracTickets for help on using tickets.