Opened 3 years ago

Closed 2 years ago

#2193 closed help (fixed)

Problem with Euro4km Reconfiguration Job

Reported by: sam89 Owned by: um_support
Component: UM Model Keywords:
Cc: Platform: Monsoon2
UM Version: 8.2

Description

Hi

I am trying to run a reconfiguration job for a Euro 4km run (xnltc). I keep getting this an error related to MPI_Init though and this can in seen in:

/home/d02/saclarxnltc000.xnltc.d17155.t133621.rcf.leave

I had this error for my Global job and the fix was to enable modifications from a branch. I made sure this was switched on in this job but it didn't seem to fix this issue this time.

Do you have any idea how to fix this?

Thanks

Sam

Change History (16)

comment:1 Changed 3 years ago by willie

Hi Sam,

You'll need to include the NCAS branch fcm:um_br/pkg/Config/vn8.2_ncas/src in the table and switch it on, just as in #2151.

Regards
Willie

comment:2 Changed 3 years ago by sam89

Hi Willie

Thanks for that, I think it seems to have worked now but I just checked the .rcf file and it says this towards the end:
TIMER has detected non-fatal error

What would cause this and does it mean the run has still worked?

xnltc000.xnltc.d17156.t200631.rcf.leave

Many thanks

Sam

comment:3 Changed 3 years ago by willie

Hi Sam,

The reconfiguration has reached the end of the program - this is a good sign. It says the error is non-fatal and then switches off the timer for the rest of the program, so it is unlikely to do any harm. It's a good idea to look at the reconfigured start dump in xconv just to check a few fields.

Regards
Willie

comment:4 Changed 3 years ago by sam89

Hi Willie,

The start dump seems fine. I am now doing the run job but I have come across an error which I don't recognise:

Error in routine: inbounda
? Error Code: 6
? Error Message: INBOUNDA : Mis-match in height generator method.
? Error generated from processor: 0
? This run generated 0 warnings

Have I made a mistake in the STASH or is it related to something else?

The job is xnltb. Output file is: xnltb000.xnltb.d17156.t203732.leave

Thanks,

Sam

comment:5 Changed 3 years ago by willie

Hi Sam,

If you look at the bottom of the pe_output files, just before the error above, it says

 LBC Integer Header Mismatch:
 LBC  : First rho level with constant height:  50
 Model: First rho level with constant height:  62

So it is complaining about the LBC input file. You have used

/projects/diamet/saclar/lbc/Euro4km_LBC_Original

I'm not sure how you obtained this, but you need to rerun the global model and ensure that it creates LBCs on the Euro4 domain and using the right vertical levels set. It looks like you need vertlevs_UK4_L70 for this - if you look inside this file you will see it has a first constant rho level of 62, which is compatible with the Euro4 model.

Regards,
Willie

comment:6 Changed 3 years ago by sam89

Hi Willie

I used Makebc to create the LBCs and as far as I was aware I used that vertical levels file when I created it.
I will contact Grenville to ask him about this.

Thanks

comment:7 Changed 2 years ago by sam89

Hi Willie

I recreated the lbc file but when i run the job it is saying there are the wrong number of fields
????????????????????????????????????????????????????????????????????????????????
???!!!???!!!???!!!???!!!???!!!???!!! ERROR ???!!!???!!!???!!!???!!!???!!!???!!!?
? Error in routine: inbounda
? Error Code: 2
? Error Message: CHK_LOOK_BOUNDA : Wrong number of LBC fields
? Error generated from processor: 0
? This run generated 1 warnings
????????????????????????????????????????????????????????????????????????????????

I compared the lbc file to an old one I created and it appears to be missing the u,v,w advected winds. I don't know why it would be missing from the file though as I created the lbcs in the same way using the same script and the dumps I am using contain these three variables so when it creates the lbc file it should use these variables as well..

see /projects/diamet/saclar/lbc/submissionscript_Euro4km
it is job xnltb

Thanks

Sam

comment:8 Changed 2 years ago by willie

Hi Sam,

This was a problem with makebc. I have built the LBC file for you - see /projects/umadmin/frmy/makebc_issue2193. The problem was that the variable L_ADV_WINDS_ON was not set in the DUMP2BOUND namelist. This is not clear from the documentation but it does appear in the output listing.

Regards
Willie

comment:9 Changed 2 years ago by sam89

Hi Willie

Thanks for doing that. I tried the run again. In the .leave file it says something about it reaching the end of the LBC file as an error message but I am not sure if it is O.K that it reached the end or if it is actually an error as it doesn't specify what the error is just that it reached the end of the file, could you check for me?

The .leave file is /home/d02/saclar/output/xnltb000.xnltb.d17164.t184304.leave

Thanks!

Sam

comment:10 Changed 2 years ago by willie

Hi Sam,

The job has been killed because it has run out of time:

_pmiu_daemon(SIGCHLD): [NID 01732] [c9-0c0s1n0] [Tue Jun 13 19:34:18 2017] PE RANK 261 exit signal Killed
[NID 01732] 2017-06-13 19:35:29 Apid 20557469: initiated application termination
=>> PBS: job killed: walltime 10045 exceeded limit 10000

It's only done 756 steps so you need to scale up your existing wall time by 864/756 and add a bit to account for load. I would try 4hours which is the maximum for the 'normal' queue.

Regards
Willie

comment:11 Changed 2 years ago by sam89

Hi Willie

I tried this again it kept failing if I tried a wall time higher but I tried what ended up being 3 hrs 57 mins and it failed again due to running out of walltime. I don't remember having this issue on the ibm when I ran the same job which is strange. I guess I should resubmit the job after 12 hours or something instead of running it for 24 hours in a row?

xnltb000.xnltb.d17166.t193324.leave

Thanks

Sam

comment:12 Changed 2 years ago by willie

Hi Sam,

It is still only doing 756 steps, which is 21hours. This is because you've set the start data to be 2012-07-05 and three hours. Set it to 2012-07-05 and it should work. The start dump, start date and the LBC need to align.

Regards
Willie

comment:13 Changed 2 years ago by sam89

Hi Willie

Sorry that is such a silly mistake! I actually have another issue now though.
I am trying to reconfigure a start dump from that run at 18 UTC as I need it on the Global vertical levels but when I try to reconfigure onto the Global vertical levels I am getting this error in the reconfig output

? Error Message: Mismatch in LEVDEPC between model and Ancillary File.

I asked Sue and she suggested switching the orography ancillaries off which I thought I had done completely but it is still giving the same error

The job is xnlte…

Thanks

Sam

comment:14 Changed 2 years ago by sam89

Hi Willie

I solved the issue!

Sam

comment:15 Changed 2 years ago by willie

Excellent. Well done.
Willie

comment:16 Changed 2 years ago by grenville

  • Resolution set to fixed
  • Status changed from new to closed
Note: See TracTickets for help on using tickets.