Opened 9 years ago

Closed 9 years ago

#639 closed help (fixed)

Error: no interpolation: but data field size/levels are different

Reported by: a.elvidge Owned by: um_support
Component: UM Model Keywords: tiles, reconfiguration
Cc: Platform:
UM Version: 7.6

Description

Hi,

Job xfxkj. This is a 4km LAM job using a UM start dump. I am getting the same error as reported in ticket #252. The solution to this was to include the following hand edit:

/home/jeff/umui_jobs/hand_edits/RECONA_add_srce_4.txt

However this handedit no longer exists.

Cheers, Andy

Change History (22)

comment:1 Changed 9 years ago by willie

Hi Andy,

I have put a copy in ~willie/hand_edits on PUMA.

Regards

Willie

comment:2 Changed 9 years ago by a.elvidge

Hi Willie,

Thanks, I have tried running with the hand edit included, but unfortunately I am still getting the same error.

Andy

comment:3 Changed 9 years ago by willie

  • Keywords tiles, reconfiguration added

Hi Andy,

This is a different error from the others, so you don't need the RECONA_ hand edit (The problem is that STASH 490 is only on one tile (=vertical level) where reconfiguration is expecting 9 tiles). You do need the user STASH master ~willie/NAE_PS24/st_rt73. This effectively ignores STASH 490.

comment:4 Changed 9 years ago by a.elvidge

Thanks, but what is STASH 490? It isnt anything that I have set as a diagnostic as far as I can see (since I am not outputting anything on tiles).

Andy

comment:5 Changed 9 years ago by willie

Andy, It is "decoupled screen temp on tiles" and it is present in the raw Met Office dumps. So it will propagate from global to NAE to 4km unless you take steps to eliminate it.

You can always find what a particular STASH name is by typing,

fgrep " 490 " $UMDIR/vn7.6/ctldata/STASHmaster/STASHmaster_A

and replacing the 7.6 with the latest installed version if necessary.
Regards,

Willie

comment:6 Changed 9 years ago by a.elvidge

Willie,

The job now fails with:

_pmii_daemon(SIGCHLD): [NID 00990] PE 29 exit signal Segmentation fault
[NID 00990] 2011-06-09 16:56:44 Apid 777051: initiated application termination
diff: /work/n02/n02/aelvidge/tmp/tmp.hector-xe6-13.382/xfxkj.xhist: No such file or directory
qsexecute: Copying /work/n02/n02/aelvidge/xfxkj/xfxkj.thist to backup thist file /work/n02/n02/aelvidge/xfxkj/xfxkj.thist_keep
xfxkj: Run failed

I can't see a lot more details in the .leave file about this error. Looking back at a previous ticket concerning this error, it looks as though its not a particulaly straight forward one. This same job has run fine before with a different start dump and different LBCs (previously I was using a start dump and LBCs from a job run by Mark Weeks at the Met Office, whereas this time I sourced the UM start dump from yourself, and ran my only global job to produce the LBCs). The only other things I have changed to the job are:

Using your user STASHmaster file ~willie/NAE_PS24/st_rt73
No longer configuring sst and sea ice ancils

Cheers, Andy

comment:7 Changed 9 years ago by grenville

Andy

There's no clue in the output - try switching on Extra diagnostic messages in Input/Output? Control and Resources → Output Choices. Change the run time to a few minutes and the required computer time to 1200s - that way you can avoid queueing too much.

Grenville

comment:8 Changed 9 years ago by willie

Hi Andy,

Could you also switch off packing in Post processing > initialisation etc. The individual profiles are unpacked but the meaning should be too.

regards,

Willie

comment:9 Changed 9 years ago by a.elvidge

Hi Willie and Grenville,

I have tried both your suggestions. The new error reads:

_pmii_daemon(SIGCHLD): [NID 01539] PE 29 exit signal Segmentation fault
[0] ERROR - nem_gni_error_handler(): a transaction error was detected,error category 0x4 error code 0xb2e
Rank 0 [Mon Jun 13 09:45:44 2011] [c3-0c0s1n0] GNI transaction error detected
[NID 01539] 2011-06-13 09:45:44 Apid 789751: initiated application termination
diff: /work/n02/n02/aelvidge/tmp/tmp.hector-xe6-14.31556/xfxkj.xhist: No such file or directory

Cheers, Andy

comment:10 Changed 9 years ago by a.elvidge

I should say, 'the error now reads'

comment:11 Changed 9 years ago by willie

Hi Andy,

You need to use huge pages. This involves three steps,

  1. Use the huge page branch VN7.6_ncas_hugepage - this replaces the current NCAS branch;
  2. Use the machine override ~willie/hugepage_override;
  3. Input/Output? scripts: add the variable HUGETLB_MORECORE and set it to yes.

You will then need to rebuild your executables.

Regards,

Willie

comment:12 Changed 9 years ago by a.elvidge

Hi Willie,

I have done as you said, the job is now running. What exactly is a huge page?

Cheers, Andy

comment:13 Changed 9 years ago by a.elvidge

Hang on a sec, the job didnt submit properly…
BASE extract failed

Let me check that Ive made the right edits (Im not sure about the first one):

1) FCM options for atmos and reconf - changed fcm:um_br/pkg/Config/VN7.6_ncas/src to fcm:um_br/pkg/Config/VN7.6_ncas_hugepage/src

2) Script inserts and mods - Variable name: HUGETLB_MORECORE, value: yes

3) UM user override files - Machine overrides: include ~willie/hugepage_override

Cheers, Andy

comment:14 Changed 9 years ago by willie

My fault: it should be fcm:um-br/dev/willie/VN7.6_ncas_hugepage/src

Willie

comment:15 Changed 9 years ago by willie

Andy,

Huge pages are a way of handling virtual memory in a shared memory system. To quote Cray,

''Cray XT systems support 2 MB huge pages and 4 KB base pages for CNL
applications. Previous versions of CNL supported only base pages. For
applications that use a large amount of virtual memory, 4 KB pages can put
a heavy load on the virtual memory subsystem. Huge pages can provide a
significant performance increase for such applications.''

Further information can be found in Cray XT™ Programming Environment User's Guide S–2396–21, available from the HECToR User Site.

Regards,

Willie

comment:16 Changed 9 years ago by a.elvidge

Hi Willie,

The job now runs, and is processing now (9 hours in to the 2 day run). However, upon viewing the data on xconv Ive found that the model has gone wrong… the data fields are all NaN except in the margins (boundary conditions from the global job). I am struggling to understand why this would be the case since practically the same 4km LAM job has been run many times previously (by the met office opperationally for our field campaign) using a 25km global UM start dump to force it. All I have done is copied the 4km job and run (successfully) my own 25km global job for the boundary conditions. Any suggestions / wisdom very welcome!

Cheers, Andy

comment:17 Changed 9 years ago by willie

Hi Andy,

The model becomes unstable at time step 7: RHS zero so GCR(2) not needed. Normally the sequence is to do a global run (25km), then an NAE run (12km) and then a 4km run, so perhaps the step from Global to 4km is too big?

Sometimes this problem can be solved by reducing the time step.

Regards,

Willie

comment:18 Changed 9 years ago by a.elvidge

Hi Willie,

To attempt to get the job running without falling over I have tried the following:
1) Running at 10sec timestep rather than 15sec.
2) Nesting a 12km job between the global and 4km, and using daily rather than climatological ssts and sea ice (OSTIA). Timestep 15secs.

But the model still becomes unstable within the first few timesteps. Have you any more suggestions?

The job is for over the Antarctic Peninsula. I copied it directly from one which had run successfully (without once falling over) operationally (twice daily) for 5 weeks during a field campaign. It ran directly from global to 4km. The new case I am running the job for is outside the duration of the field campaign and is suspected to involve strong winds, but then we had some very strong winds during the field campaign too, without the model complaining. I am hoping that it is something I am doing wrong / neglecting, rather than it being related to the UM setup. Have you any suggestions?

Thanks, Andy

comment:19 Changed 9 years ago by willie

Hi Andy,

You could revert to one of the field campaign start dumps to check it is still working. If it is the wind speed (it reaches 24m/s or more than a gale over the mountains), then there are two options. One is to reduce the time step even further, to 5 seconds say. The other is more speculative: you could try adjusting the time weight coefficients in the primary advection scheme (Atmos > Scientific sections). This is mentioned in the USer Guide Chapt 4 p90. Perhaps alpha1 and alpha3 could be increased to 0.75, say?

Regards,

Willie

comment:20 Changed 9 years ago by a.elvidge

Hi Willie,

After doing some umui investigation into how the operational runs were done, I have discovered the problem. The global start dumps were reconfigured in a separate job before being used in the main run. I tried this with my new start dump and it works - the job no longer becomes unstable. I do not, however, understand this because my main run has the reconfiguration switched on, so as far as I can see the solution is to reconfigure the start dump twice…? Surely not.

Thanks, Andy

comment:21 Changed 9 years ago by willie

Hi Andy,

Glad you solved the problem. Remember that the reconfiguration inherits instructions from the model set up, so if you've changed any ancillary files there or made other changes, it will be different.

Regards,

Willie

comment:22 Changed 9 years ago by willie

  • Resolution set to fixed
  • Status changed from new to closed
Note: See TracTickets for help on using tickets.