Opened 5 years ago

Closed 4 years ago

#1535 closed help (answered)

Global forecast job template

Reported by: dsergeev Owned by: um_support
Component: UM Model Keywords: global weather forecast, template
Cc: Platform: ARCHER
UM Version: 8.2

Description

Hi,

Could you please recommend an existing job in a global weather forecast mode? It has to be the N512 resolution and for the any model version higher than 8.2.

I'm trying to run the global UM and get LBCs for a regional version (over the Norwegian Sea). The regional run settings are from the experiments done by Mark Weeks at the Met Office and I have the job imported (xlce-a).

I have tried to use the xjlej job (xlcec in my experiment), which is an N512 GA6 Endgame job, but I'm struggling with settings for the ancillary files. I noticed that it was a climate job, so maybe there is another job that is not a "climate" job and has fewer additional ancillary files for input?

This is a new ticket continuing the discussion here: http://cms.ncas.ac.uk/ticket/1502
I created a new ticket since the latter has different topic in the beginning.

Regards,
Denis

Change History (35)

comment:1 Changed 5 years ago by dsergeev

Oh um_support, why hast thou forsaken me?…

comment:2 Changed 5 years ago by grenville

Denis

Sorry for the delay - you need to create user stash records for the missing items to tell the UM how to reserve space and initialize the missing data.

There are lots of examples of these kinds of stash records - please look at /home/grenville/USERSTASH/115_ignore - you will need one of these records for each item missing.

The file should be added to the table in Atmosphere → STASH → User-STASHmaster files…

Please give us permission to read your files on /home and /work (when it revives)

It's well worth looking at UMDP C4 (Storage Handling and Diagnostic System) to better understand the STASH record.

Where did your job come from?

Grenville

comment:3 Changed 5 years ago by dsergeev

Grenville,

Thanks for the reply. The N512 GA6 Endgame job is a copy of xjlej, as you recommended in the ticket 1502.

However, at the moment I'm trying to run another job that seems more relevant to what I need. I'm trying to make the experiment xljk work, which is a copy of xkfr - "Reduced Nesting Suite". Namely, I'm trying to set the job "a" to run (it's a N216L70 RCF cycle 1). I copied the "um_nesting" directory to my folder, as it is required in several sections, and the model's submit is OK. But it crashes during the UMATMOS build, because it can't find "eg_idl_set_init_mod" file. I suspect that it happens because I switched off bottom and top script inserts (because I didn't have an environment variable $TOP_BOT_DIR set up and couldn't find how to define it).

So maybe it would be even easier for you to help me to make this job to run?

As for my directories on PUMA/ARCHER, they should be available to you to read.

Regards,
Denis

comment:4 Changed 5 years ago by dsergeev

Hi again,

So have you looked at the Nesting Suite job on my account? ARCHER seems to be available by now, and I'm really keen to run the UM as soon as possible.

Thanks,
Denis

comment:5 Changed 5 years ago by grenville

Denis

We still can't read your files

Please do this:

chmod -R g+rX /home/n02/n02/<your-username>
chmod -R g+rX /work/n02/n02/<your-username>

I wouldn't recommend the nesting suite - the problem you had earlier with missing fields in the dump is generic - it arises because the start dump you are reconfiguring isn't quite what the model needs. That issue may arise again.

Please try reconfiguring the dump with the amended STASH.

Grenville

comment:6 Changed 5 years ago by dsergeev

Grenville,

I changed the permissions, now it should be available. Sorry for being so slow, but can you please explain me how to find out what field is missing in a start dump? tried to run Willie McGuinty?'s utilities (​http://cms.ncas.ac.uk/ticket/1538) for checking the start dump, but they do not give me any complaints.

Thank you,
Denis

comment:7 Changed 4 years ago by grenville

Denis

Since Willie's utilities didn't work, I can only suggest trying to run the reconfiguration - it will fail and tell you what's missing. The add the appropriate use stash file.

What was the job id of your N512 job?

Grneville

comment:8 Changed 4 years ago by dsergeev

Grenville,

The job id is xlcec.

Now the job failed due to the field 151 missing:

????????????????????????????????????????????????????????????????????????????????
???!!!???!!!???!!!???!!!???!!!???!!! ERROR ???!!!???!!!???!!!???!!!???!!!???!!!?
? Error in routine: Rcf_Set_Data_Source
? Error Code:    30   
? Error Message: Section   0 Item   151 : Required field is not in input dump!
? Error generated from processor:     0    
? This run generated   1 warnings
????????????????????????????????????????????????????????????????????????????????

How do I find the corresponding variable?

Denis

comment:9 Changed 4 years ago by dsergeev

Update:

I found which field is under the number 151 - it's related to river routing. So I switched that off (since it's not important for my runs).

But now reconfiguration fails due to another field missing in the start dump:
????????????????????????????????????????????????????????????????????????????????
???!!!???!!!???!!!???!!!???!!!???!!! ERROR ???!!!???!!!???!!!???!!!???!!!???!!!?
? Error in routine: Rcf_Set_Data_Source
? Error Code: 30
? Error Message: Section 0 Item 505 : Required field is not in input dump!
? Error generated from processor: 0
? This run generated 1 warnings
????????????????????????????????????????????????????????????????????????????????

which is "Land fraction in grid box". I tried to find a switch for that in Scientific Parameters and ancillaries, but couldn't. Where do I need to look?

You can also look at my .rcf.leave file, as it has a lot of messages related to the error above, including the land-sea mask. Path on ARCHER:
/home/n02/n02/dsergeev/output/xlcec000.xlcec.d15136.t135638.rcf.leave

Regards,
Denis

Last edited 4 years ago by dsergeev (previous) (diff)

comment:10 Changed 4 years ago by dsergeev

Could you help me on this, please?

comment:11 Changed 4 years ago by grenville

Denis

We'll take the job and try to get it working - I see that you have switched off reconfiguration of the ancillary land fraction file - was that intentional?

Grenville

comment:12 Changed 4 years ago by dsergeev

Grenville

Thank you.
Switching off the configuration of ancillary land fraction file was semi-intentional, I would say. I have been fiddling with these options trying to avoid the error with missing fields. At this stage I think any configuration that you can make work will be ok for me.

Denis

comment:13 Changed 4 years ago by grenville

Denis

I have a model which no longer complains about missing fields, but the model has convergence problems - do you have other initial data that we can try with - ie any other start files?

Grenville

comment:14 Changed 4 years ago by dsergeev

Grenville,

The only file I have is that start dump 20130326_qwqg00.T+0.

Thank you,
Denis

comment:15 Changed 4 years ago by grenville

Denis

Please look at xjjzb - it is an N512 UM 8.6 job which runs with your start file. I haven't run it for very long (just 20 mins in the short queue), but it appears to work OK.

Grenville

comment:16 Changed 4 years ago by dsergeev

Grenville,

Thanks for setting it up. I copied the job as xlced in my experiment and looked through it in umui.

Firstly, should I switch ON the compilation of the model and reconfiguration executables?

Secondly, I noticed the sections order in version 8.6 has slightly changed, and I can't find where to set LBC's output for the limited-area experiments: there is no section in Control→Output data files (LBC's etc)→LBC's out. I had an output there switched off for the Norwegian Sea area.

Regards,
Denis

comment:17 Changed 4 years ago by dsergeev

Hi Grenville,

Can you please answer the previous question? I turned on compilation and also OpenMP (because there appeared an OMP-related error).

Now the reconfiguration exits without errors, but I got an error during run:

apsched: claim exceeds reservation's CPUs

Could you tell me what causing this error?

Denis

comment:18 Changed 4 years ago by grenville

Denis

I regret the delay in responding to your messages, but we are dealing with many user queries.

Yes, you need to rebuild the model.

UM version 8.6 uses a different method for creating lbcs - you will need to run makebc on stash output from a stash macro. Please see the UM documentation on our wiki.

Do you have a limited area model set up for the Norway runs?

I feel that this is not progressing as smoothly as it could and unless you have some idea of the overall scheme of the computation, we will potentially keep hitting problems.

When you changed the OMP settings you didn't change the number cores/node. This needs to be 12 when running with 2 OMP threads - what was the problem with OMP?

Grenville

comment:19 Changed 4 years ago by dsergeev

Grenville,

Actually, I've just switched off OpenMP usage to reproduce the error, but now it has compiled and run successfully. So, I guess, we can forget about the OpenMP issue for now.

Concerning the Limited-Area run: I have imported the job with settings for an operational run (xlce-a), which was used by M. Weeks at Met Office. Can you have a look at it and say if I can use it with the current global model output?

What do you mean by the overall scheme of computation?

Thank you for your time,
Denis

comment:20 Changed 4 years ago by dsergeev

Hi Grenville,

Sorry to remind you again, but could you answer my answer above? Have you looked at my job (xlce-a)?

Regards,
Denis

comment:21 Changed 4 years ago by grenville

Denis

xlcea - that's the limited area model at UM 8.2 - I'd try to get lbcs from an 8.2 global model. We have N216 and N512 models at UM 8.2 which may help - I shall have a look

Where does M. Weeks get lbs from?

Grenville

comment:22 Changed 4 years ago by dsergeev

Grenville,

Does it mean that 8.6 and 8.2 are incompatible due to different dynamics core?

I guess he used some operational global dumps, but it was not generated for the date I need (26/3/13 0z). That's why I wanted to have my own global runs first.
Another similar LAM job at 8.6 for example would be fine for me, if it is easier to get running.

Thank you,
Denis

comment:23 Changed 4 years ago by dsergeev

Also, I tried to run the global xlced experiment, and got a write error (in the .leave file):

DUMPCTL: Opening new file /work/n02/n02/dsergeev/xlced/xlceda.da20130326_12 on unit  22

OPEN:  File /work/n02/n02/dsergeev/xlced/xlceda.da20130326_12 to be Opened on Unit 22 does not Exist
OPEN:  File /work/n02/n02/dsergeev/xlced/xlceda.da20130326_12 Created on Unit 22
IO: Open: /work/n02/n02/dsergeev/xlced/xlceda.da20130326_12 on unit  22

 WRITING UNIFIED MODEL DUMP ON UNIT 22
 #####################################

BUFFOUT: C I/O Error - Return code = 1
 WRITEDUMP: Error in call to BUFFOUT
 Field :  2041 
 LEN_IO :  318976
 IOSTAT :  1.

????????????????????????????????????????????????????????????????????????????????
???!!!???!!!???!!!???!!!???!!! ERROR        ???!!!???!!!???!!!???!!!???!!!???!!!
?  Error   Code:   400  
?  Error   Message: Failure writing out field
?  Error   from processor:     0    
?  Error   number:     7    
????????????????????????????????????????????????????????????????????????????????

Is it because I exceeded the disk quota?
My $WORKDIR contains the dumps xlceda.da20130326_06 (18G) and xlceda.da20130326_12 (11G). EDIT: files moved to /nerc/n02/n02/dsergeev/xlced to free some space…

Regards,
Denis

Last edited 4 years ago by dsergeev (previous) (diff)

comment:24 Changed 4 years ago by grenville

Denis

I have tested your start file with an 8.2 N512 job. The reconfiguration job is xjjzy - it uses an executable which includes support for grib2 - that's not necessary for your start file - you probably won't need to reconfigure again.

The UM job is xjjzw. It's set up to run in the short queue - if you use it, you might need to experiment with the number of processors it uses. It's running at about 40 mins/model day.

You should be able to get lbcs from this job to drive your LAM.

You appear to have ~60GB free spaced on /work (you can check this in SAFE)

Grenville

comment:25 Changed 4 years ago by dsergeev

Grenville,

Thanks a lot for configuring it.

I ran the reconfiguration job, it seemed to finish successfully and generated an .astart file. Then I launched the actual model run, first with compilation of the model executable and then without, having it already in my $WORDKDIR.

I set up the necessary LBC output (channel 4, Norwegian Sea 4km grid step) and tried to run the model in standard queue. It has run successfully, producing the LBC files every hour. The problem is that it stops at ~200 time step (33-34 hours of model time), but I need it to run at least for 2 days.

I can't understand why the job ends (time limit is not exceeded) and I didn't find anything similar in old tickets. Could you please have a look at the .leave file in my folder on ARCHER? The name is

xlcew000.xlcew.d15161.t003206.leave

Regards,
Denis

comment:26 Changed 4 years ago by grenville

Denis

The run has failed because of an OOM error (the leave file does indicate this).

We have seen this error in some long running 8.2 jobs (and others) built with the cce8.2.1 compiler. We are currently testing the cce8.3.7 compiler which seems not to have this problem.

You could break the run into a series of shorter sections - or you could try using an executable just created with cce8.3.7 — configure your job to use /work/n02/n02/grenvill/um/xjjzv/bin/xjjzv.exec.

Grenville

comment:27 Changed 4 years ago by dsergeev

Grenville,

Thanks.
I tried changed the run path to your executable and launched the model, but got this error on the first time step:

 An error occured inside the MPI library during an operation
 on the communicator: model
 MPI_COMMUNICATOR= -1006632957 MPI_ERROR_CODE= 537511169  aborting...

The output log is in xlcew000.xlcew.d15162.t214454.leave

What did I do wrong?

Denis

comment:28 Changed 4 years ago by grenville

Denis

Please include

MPICH_NO_BUFFER_ALIAS_CHECK=1

in Input/Output? Control.. → Script Inserts and Modifications and run again.

Grenville

comment:29 Changed 4 years ago by dsergeev

Grenville,

Thanks, now the global model runs for 48 hours.

As I said above, the next step is to run a limited-area model at 4km resolution (xlcea). I have the LBCs now (/work/n02/n02/dsergeev/xlcew/xlcewa_cb*) and I put the path to them in the "Lateral Boundary Conditions" section. However, I have some questions:

1 )I'm not sure what file I have to use as a start dump. Does it have to be the same global start dump as I used for the global run? Should I run a reconfiguration first?
2) In the sections with environmental variables there are several paths defined, e.g. to ancillary files that are used in operational runs at Met Office, as I assume. Are there similar directories on ARCHER? If not, what should I do? The same problem is with a few user hand edit files.

Please, could you have a look at this xlcea job? For now it's not critical for me to use exactly that job, so I'll accept any modifications that will make it run.

Best regards,
Denis

comment:30 Changed 4 years ago by grenville

Denis

You will need to configure the job to build on ARCHER - you can probably use the UM Training job as a template since it is an 8.2 model - have a look at the FCM settings in that job (don't copy science settings since the UM training job was set up for a region over N Africa with dust).

You will need to reconfigure the global start dump for the LAM run.

Best if you can get the ancillary files from your colleague at the MO - it looks like you only need soil and vegetation files, otherwise you will need to generate your own.

Grenville

comment:31 Changed 4 years ago by dsergeev

Grenville,

I've contacted Mark Weeks at the MO, he sent me the ancillary files and helped me setting the model up. As you recommended, I copied the FCM settings from the UM Training job, and it builds successfully.

Now reconfiguration fails due to a file-not-found error:

OPEN:  File /home/n02/n02/dsergeev/LAM_ancils/qrparm.mask to be Opened on Unit 12 does not Exist
OPEN:  **WARNING: FILE NOT FOUND
OPEN:  Ignored Request to Open File /home/n02/n02/dsergeev/LAM_ancils/qrparm.mask for Reading
 ****************** IO Error Report ***********************************
Unit Generating error=   12  
*** File states can be reported by setting diagnostic output levels **

????????????????????????????????????????????????????????????????????????????????
??????????????????????????????????? WARNING ????????????????????????????????????
? Warning in routine: mppio:file_open
? Warning Code:   -12 
? Warning Message: An error occured opening a file
? Warning generated from processor:     0   
????????????????????????????????????????????????????????????????????????????????

But the file definitely exists in that directory.

Could you help me please?

Regards,
Denis

comment:32 Changed 4 years ago by grenville

Denis

These files need to be on the /work file system. Parallel jobs don't have access to /home.

Grenville

comment:33 Changed 4 years ago by dsergeev

Grenville,

Thank you. The next error appears in the LBCs when the run starts:

OPEN:  File /work/n02/n02/dsergeev/xlcew//xlcewa_cb0000 to be Opened on Unit 125 Exists
MPPIO: Open: /work/n02/n02/dsergeev/xlcew//xlcewa_cb0000 on unit 125     
MPPIO: from environment variable ALABCIN1
  INBOUNDA; not enough space for LBC lookup headers.
            try increasing value specified in umui 
            window atmos_Infile_Options_Headers

????????????????????????????????????????????????????????????????????????????????
???!!!???!!!???!!!???!!!???!!!???!!! ERROR ???!!!???!!!???!!!???!!!???!!!???!!!?
? Error in routine: inbounda
? Error Code:     2    
? Error Message:  INBOUNDA: Insufficient space for Lookup Table
? Error generated from processor:     0    
? This run generated  11 warnings
????????????????????????????????????????????????????????????????????????????????

I increased the number of LBC headers from 100 to 300, but got another error:

 LBC Integer Header Mismatch:
 ROW_LENGTH from INTHD:  1024 
 Model ROW_LENGTH:  318  

????????????????????????????????????????????????????????????????????????????????
???!!!???!!!???!!!???!!!???!!!???!!! ERROR ???!!!???!!!???!!!???!!!???!!!???!!!?
? Error in routine: inbounda
? Error Code:     3    
? Error Message:  Integer header (ROW_LENGTH) mismatch
? Error generated from processor:     0    
? This run generated  11 warnings
????????????????????????????????????????????????????????????????????????????????

The hourly LBCs are specified in the files
/work/n02/n02/dsergeev/xlcew/xlcewa_cb0000,
/work/n02/n02/dsergeev/xlcew/xlcewa_cb0100,

/work/n02/n02/dsergeev/xlcew/xlcewa_cb4800.

What am I doing wrong?

Denis

comment:34 Changed 4 years ago by grenville

Denis

Files like xlcewa_cb0000 don't contain lbcs. The lbcs are in xlcew.alabcou4

You may need to increase the amount of space the model needs for lbc headers in atmosphere→ancillary and input..→In file related..→Header record sizes

Grenville

comment:35 Changed 4 years ago by grenville

  • Resolution set to answered
  • Status changed from new to closed

Denis

I am closing this ticket - it's getting too long and mixes too many issues. Please open another if you have further problems.

Grenville

Note: See TracTickets for help on using tickets.