Opened 10 years ago
Closed 10 years ago
#661 closed help (fixed)
reconfiguration from ecmwf grib dump fails
Reported by: | eartmt | Owned by: | willie |
---|---|---|---|
Component: | UM Model | Keywords: | |
Cc: | Platform: | ||
UM Version: | 6.6.3 |
Description
I have troubles starting from ecmwf dump. My job xgbh.e fails at reconfiguration. The leave file from my latest attempt is in:
~eartmt/um/umui_out/xgbhe000.xgbhe.d11209.t120033.leave
Here is the relevant error message, don't know what it means:
C I/O Error: failed in BUFFIN8
Return code = 1
Thanks in advance for any help,
Tomek
Change History (13)
comment:1 follow-up: ↓ 2 Changed 10 years ago by jeff
comment:2 in reply to: ↑ 1 Changed 10 years ago by eartmt
Sorry about that, should be readable now.
comment:3 Changed 10 years ago by willie
Hi Tomek,
I normally run a standard reconfiguration job on GRIB files before doing anything else with them. If you take a copy of the job xdkeb (user 'umui') and run this on your GRIB file you will get a new reconfigured start dump in UM format. You can then use this with xgbhe.
Regards,
Willie
comment:4 Changed 10 years ago by eartmt
Hi Willie,
I reconfigured the startdump with xdkeb (copied to xgbhb) as you suggested, but xgbhe is still failing at reconfiguration, this time with a lot of messages like:
REPLANCA: UPDATE REQUIRED FOR FIELD 1 : Land-Sea Mask
I also reconfigured startdump with another standard job (xdkea), but got exactly the same problem using it. The leave file from my last attempt today:
~eartmt/um/umui_out/xgbhe000.xgbhe.d11217.t163641.leave
Regards,
Tomek
comment:5 Changed 10 years ago by willie
- Owner changed from um_support to willie
- Status changed from new to assigned
Hi Tomek,
If you look for error earlier in the file, you will see,
[0] ERROR - MPID_nem_gni_check_localCQ(): GNI_CQ_EVENT_TYPE_POST had error (SOURCE_SSID_DREQ:MDD_INV) Rank 0 [Fri Aug 5 15:40:13 2011] [c1-1c1s0n1] Fatal error in MPI_Testall: Other MPI error, error stack: MPI_Testall(251)...............: MPI_Testall(count=95, req_array=0x7ffffff15700, flag=0x7ffffff14f74, status_array=0x7ffffff14f90) failed MPIDI_CH3I_Progress(150).......: MPID_nem_mpich2_test_recv(790).: MPID_nem_gni_poll(1276)........: MPID_nem_gni_check_localCQ(560): unrecoverable network error
This is the real problem. Your job did not have this on 28/July at 12:30, but all subsequent runs do have this. Could you look in the edit history and let me know what changed?
Regards,
Willie
comment:6 Changed 10 years ago by eartmt
Hi Willie,
I'm away this week and can't access umui, but AFAIR the successful job was with the original startdump (as in xgbh.d) with no changes otherwise. You can also diff xgbh.e with xgbh.d (this one works, as far as can tell, and was the base for xgbh.e) to see what exactly differs.
Tomek
comment:7 Changed 10 years ago by willie
Hi Tomek,
If you look at the bottom of the output file, the reconfiguration has failed:
ERROR!!! in reconfiguration in routine Rcf_Set_Data_Source Error Code:- 30 Error Message:- Section 0 Item 101 : Required field is not in input dump!
Naturally, this is because the start dump is based on a GRIB file which contains only the minimum start the model.
To get round this you need to create a user STASH entry for this item (=S02 mass mixing ratio). Then you need to make an entry in the STASH > Initialisation of User Prognostics table. Option 3 (set to zero) or 7 (initialise from ancillary) are the likely choices.
Regards,
Willie
comment:8 Changed 10 years ago by eartmt
Hi Willie,
Where do you see that error? I grep through all the *.leave files in my home and the only one that mentions Error Code 30 is for completely unrelated vn7.8 job xgbht and even there it is Item 9 rather than 101 that is missing, so still not the same error message.
The start dump /work/n02/n02/eartmt/xgbhb/xgbhb.astart for xgbhe job was produced with the standard vn6.1 reconfiguration job, as you suggested, and except for the BUFFIN8 thingy I don't see any error in the output of that reconfiguration job:
~eartmt/um/umui_out/xgbhb000.xgbhb.d11216.t165743.leave
Apologies, if I'm blind, but I just can't find that error.
Regards,
Tomek
comment:9 Changed 10 years ago by eartmt
I get the error about 'Section 0 Item 101…' when I use this xgbhb.astart start dump with the vn7.8 job xgbht, so I will try your suggestion for this one. However, the job xgbhe (vn6.6.3), for which is this ticket, does not seem to have the same issue.
Tomek
comment:10 Changed 10 years ago by willie
Hi Tomek,
Sorry I was quite clear. It is in the output directory /work/n02/n02/eartmt/xgbhe/xgbhe.fort6.rcfa.pe1, at the bottom of the file.
Regards,
Willie
comment:11 Changed 10 years ago by willie
Oops, That should read wasn't quite clear.
Willie
comment:12 Changed 10 years ago by eartmt
Hi Willie,
So I created a user stash file with missing item 101 and the job went a bit further, but only a little, failing with missing item 102. So I added that and it failed with item 103 and so on. After a while I fired up xconv and started to compare the content of the original startdump with the ECMWF-generated one and after adding 61 items in section 0 and 49 in section 33, all initialized to zero, I finally got past the missing data issue. I wonder, however, whether there is some better way to figure out which fields are missing?
Tomek
comment:13 Changed 10 years ago by willie
- Resolution set to fixed
- Status changed from assigned to closed
Hi Tomek
Your file permissions on Hector don't allow anyone else to view your files. Running these commands on hector will allow people in the n02 group to read your files.
chmod -R g+rX /home/n02/n02/eartmt
chmod -R g+rX /work/n02/n02/eartmt
Jeff.