Opened 5 years ago

Closed 5 years ago

#1393 closed help (fixed)

"Not a typewriter" error in reconfiguration for large domain UM run.

Reported by: dgrosv Owned by: um_support
Component: UM Model Keywords: LAM, reconfiguration, memory
Cc: Platform: MONSooN
UM Version: 8.5

Description

Hi,

I'm trying to do a large domain UM run (2000x2000 points at 1km resolution) on Monsoon. I have got this setup working for 1000x1000km domains, but at the larger size I get the error (xkndc000.xkndc.d14287.t160754.rcf.leave) :-

BUFFIN: Read Failed: Not a typewriter

????????????????????????????????????????????????????????????????????????????????
???!!!???!!!???!!!???!!!???!!!???!!! ERROR ???!!!???!!!???!!!???!!!???!!!???!!!?
? Error in routine: io:buffin
? Error Code: 22
? Error Message: Error in buffin errorCode=0.00 len= 0/ 2048
? Error generated from processor: 0
? This run generated 1 warnings
????????????????????????????????????????????????????????????????????????????????

gc_abort (Processor 0): Job aborted from ereport.

Traceback:

Offset 0x00000010 in procedure xltrbk_
Offset 0x0000011c in procedure gc_abort_, near line 134 in file /home_proj_work/home/nwp/nm/frml/gcom4.5/meto_ibm_pwr7_mpp/preprocess/src/gcom/gc/gc_abort.F90
Offset 0x00000618 in procedure
ereport_mod_NMOD_ereport64_, near line 102 in file /projects/asci/dgrosv/xjxkt/umrecon/ppsrc/UM/control/misc/ereport_mod.f90
Offset 0x00000244 in procedure io_NMOD_io_ereport_, near line 488 in file /projects/asci/dgrosv/xjxkt/umrecon/ppsrc/UM/io_services/model_api/io.f90
Offset 0x00000548 in procedure
io_NMOD_buffin64_r_, near line 1782 in file /projects/asci/dgrosv/xjxkt/umrecon/ppsrc/UM/io_services/model_api/io.f90
Offset 0x000003b0 in procedure rcf_read_multi_mod_NMOD_rcf_read_multi_, near line 147 in file /projects/asci/dgrosv/xjxkt/umrecon/ppsrc/UM/utility/qxreconf/rcf_read_multi_mod.f90
Offset 0x00000528 in procedure rcf_readflds_, near line 239 in file /projects/asci/dgrosv/xjxkt/umrecon/ppsrc/UM/utility/qxreconf/rcf_readflds.f90
Offset 0x000062e8 in procedure replanca_rcf_replanca_, near line 1084 in file /projects/asci/dgrosv/xjxkt/umrecon/ppsrc/UM/utility/qxreconf/replanca-rcf_replanca.f90
Offset 0x0000316c in procedure
rcf_ancil_atmos_mod_NMOD_rcf_ancil_atmos_, near line 906 in file /projects/asci/dgrosv/xjxkt/umrecon/ppsrc/UM/utility/qxreconf/rcf_ancil_atmos_mod.f90
Offset 0x00000090 in procedure rcf_ancil_mod_NMOD_rcf_ancil_, near line 78 in file /projects/asci/dgrosv/xjxkt/umrecon/ppsrc/UM/utility/qxreconf/rcf_ancil_mod.f90
Offset 0x000001b8 in procedure
rcf_control_mod_NMOD_rcf_control_, near line 136 in file /projects/asci/dgrosv/xjxkt/umrecon/ppsrc/UM/utility/qxreconf/rcf_control_mod.f90
Offset 0x000002a0 in procedure reconfigure, near line 85 in file /projects/asci/dgrosv/xjxkt/umrecon/ppsrc/UM/utility/qxreconf/reconfigure.f90
—- End of call chain —-

/projects/um1/vn8.5/ibm/scripts/qsrecon: Error in dump reconfiguration - see OUTPUT

This is in the nested suite and we are going straight from global to 1km.There should not be any large terrain in the domain, but possibly a small island, which I'm told can cause issues also (I will double check about this). We're pretty sure that a lack of memory is not an issue since we have applied a handedit that spreads the load over more nodes - currently the max memory usage is only 8%.

Thanks for any help,

Daniel (Leeds).

Change History (13)

comment:1 Changed 5 years ago by grenville

Hi Daniel

Sorry for the slow reply - is this still an issue?

Grenville

comment:2 Changed 5 years ago by dgrosv

Hi Grenville,

No worries. Yes, still an issue I'm afraid. People at the Met Office are also looking into it, but I've not heard anything yet.

Thanks,

Dan.

comment:3 Changed 5 years ago by grenville

Dan

Please let us have read permission on your output directories

Thanks

Grenville

comment:4 Changed 5 years ago by dgrosv

Hi Grenville,

It looks like you should have permission to see the directory - typing ls -la shows "r" read permissions for all categories of user. I've done a chmod a+r to make sure. The problem may be that I deleted all my leave files and so the one I quoted above no longer exists. However, I re-ran the run to produce this leave file:- xknde000.xknde.d14300.t132743.rcf.leave — Can you read this?

Or was it the output fields in /projects/asci/dgrosv/VOCALS4a_121_AH_2000km/ that you are referring to? These also look to be readable - let me know if it's not and whether there is something I can do to make it readable.

Thanks,

Dan.

comment:5 Changed 5 years ago by grenville

Dan

Thanks - my fault for looking on the wrong machine.

Can you try telling the model to not reconfigure the ozone, ie click "Not Used" instead of "Configured".

Grenville

comment:6 Changed 5 years ago by dgrosv

Hi Grenville,

Ok, I've done that and it got rid of the "not a typewriter" error. However, it now seems to run out of time for the RCF job. I tried increasing this to 3600s in the UMUI (User information and submit method > job submission method > loadlev), saving and processing, but it still seems to be set to 1500s according to the leave file (xknde000.xknde.d14301.t191911.rcf.leave). Is there somewhere else that I need to set this?

Thanks,

Dan.

comment:7 Changed 5 years ago by grenville

Dan

Please look in model selection→ reconfiguration→ general reconfiguration options - let us know if that works — it's a little troubling that your ozone ancillary is causing problems. If you are doing short runs (is not climate), using the ozone from the global dump is probably OK.

Grenville

comment:8 Changed 5 years ago by dgrosv

Hi Grenville,

A quick update - I have been increasing the requested wall clock time, but I am still running out (last try was 7200s).

For these runs I am applying a handedit written by Stewart Webster - it looks like this increases the number of tasks to 128 (32 nodes) in order to not run out of memory. I will try not applying this and increasing the number of processors (before using the handedit it was set at 4 x 8 (east-west x north-south).

I will change this in User Information > Job submission , and also in Reconfiguration > General reconfig. Is that everything that needs to be changed to use more processors?

Thanks,

Dan.

comment:9 Changed 5 years ago by dgrosv

Hi Grenville,

I managed to get the RCF to work by removing the Stewart's large domain hand edit and increasing the number of processors to 8x8. This took 8394 seconds to run and reached 85% of the memory limit - so pretty close to the limits there!

It went on to start the FCAST run - this ran out of wall clock time before it got the dump time of 3 hours (it reached 2.5hrs), but I know how to change this. Hopefully this run should work once that is done.

Thanks for the help on this. I suppose that it might be good to get to the bottom of why the ozone error occurred. Also, for even bigger runs I would anticipate further problems with the RCF nest. Is it possible to do dumping and resubmitting for RCF to avoid reaching the maximum allowed job time of 3 hours? Increasing the number of processors further may help up to a point, but I imagine that this would not work indefinitely.

Thanks again,

Dan.

comment:10 Changed 5 years ago by grenville

Dan

Glad you got it working - hand edits are frequently the last place we look as the source of errors; good catch.

I don't follow your query about "dumping and resubmitting for RCF"- once you have a start file (resulting from the reconfiguration process), there should be no need to perform a reconfiguration again (until you need another start file).

Increasing the number of processors for RCF will only get you so far before doing so makes it run slower.

Grenville

comment:11 Changed 5 years ago by dgrosv

Hi Grenville,

For the dumping and resubmitting I was referring to the RCF process itself - presumably if I wanted to increase the size of the domain further still then the wall-clock time taken by the RCF process would start to approach the 3 hour hard limit for Monsoon jobs (since it got close for the job here). Increasing the number of processors might allow it to finish in time, but as you say, this will only work up to a limit. I was wondering whether the RCF process itself can be broken down into stages (using dumps and then reloading) to split the task over time (as is done for the FCAST runs)?

Cheers,

Dan.

comment:12 Changed 5 years ago by grenville

Dan

Ahh - no, the reconfiguration can't be done in steps like a CRUN.

Grenville

comment:13 Changed 5 years ago by annette

  • Keywords LAM, reconfiguration, memory added
  • Platform set to MONSooN
  • Resolution set to fixed
  • Status changed from new to closed
  • UM Version changed from <select version> to 8.5
Note: See TracTickets for help on using tickets.