Opened 9 years ago

Closed 9 years ago

#381 closed error (fixed)

Problem with global job for LAM

Reported by: a.elvidge Owned by: um_support
Component: UM Model Keywords:
Cc: Platform:
UM Version: 7.1

Description

Hi,

I am attempting to run a global job (xenoa) to generate LBC's for a LAM.

It is failing at the compilation stage with the following error:

gmake: * No rule to make target `boundval.done', needed by
`atm_step.done'. Stop.
gmake:
* Waiting for unfinished jobs….
gmake -s -j 8 all failed (2) at /work/n02/n02/hum/fcm/lib/Fcm/Build.pm
line 625
Build failed on Fri Jan 29 13:56:21 2010.

Im not sure what Ive done wrong..?

Thanks, Andy

Change History (23)

comment:1 Changed 9 years ago by ros

Hi Andy,

The most common reason for this type of error is an incorrect DEF (the flags that indicate which code to include/exclude) being set which means that code that is needed has been excluded or vice versa.

I suggest you follow the advice on the FAQ available on the UM Trac Wiki page (see No Rule To Make Target Error) Which I hope will help you to solve your problem. I'm unable to check your job right now, but would guess you may have a LAM switch set which is conflicting with the Global settings.

Let us know if you still have problems.

comment:2 Changed 9 years ago by a.elvidge

Hi Ros,

My error is
gmake: * No rule to make target boundval.done', needed by atm_step.done'

On that FAQ page it says

"gmake: * No rule to make target <file1>.done, needed by <file2>.done.

An error message like this in a .comp.leave file is likely to be caused by a dodgy preprocessor key somewhere. Check the <file1> f90 file in <expt_dir>/ummodel/ppsrc/UM/…../<file1>.f90 on HECToR. If this file is empty then that is the problem. "

However, I have no file in any of the subdirectories of <expt_dir>/umbase/ppsrc/UM/ named 'boundval.F90'.

The FAQ goes on to say
"Go to <expt_dir>/umbase/src/UM/…./<file1>.F90."

However, my <expt_dir>/ummodel/src directory is completely empty.

Thanks, Andy

comment:3 Changed 9 years ago by ros

Hi Andy,

The files you needed to look at are xenoa/ummodel/ppsrc/UM/atmosphere/lbc_input/boundval-boundva0.f90 and boundval-boundva1.f90 which are both empty. There are multiple versions of the boundval routine so the file names are slightly different to what the FAQ stated (I've tried to make that clearer now).

If you then go and look at xenoa/umbase/src/UM/atmosphere/lbc_input/boundval-boundva1.F90 (note it's umbase not ummodel directory) you will see that in order for there to be any code present the DEFS A31_1A, ATMOS need to be set and GLOBAL must NOT be set.

This job has a mixture of LAM and GLOBAL settings which is why you are seeing this error. It's not recommended to try and convert a LAM job to a GLOBAL job. Grenville will advise you further.

comment:4 Changed 9 years ago by a.elvidge

Thanks Ros
I'll abandon the job and start again with a global template.

comment:5 Changed 9 years ago by grenville

Andy

I have an N320 global job that makes boundary conditions. It is a 6.1 job, but the lbcs should be OK for a 7.1 model. The job is xdnka.

Grenville

comment:6 Changed 9 years ago by a.elvidge

Hi Grenville,

I tried using xdnka, but found the following error:

"ERROR!!! in reconfiguration in routine Rcf_Grib_Read_Data
Error Code:- 50
Error Message:- Error 3 returned by DECODE"

I also tried using Willie's 7.1 job xeqfa, which resulted in this error:

"Fault address is 4194168 bytes below the first valid mapping boundary, which is at 0x400000.

This may have been caused by a struct access through a null pointer."

I cannot find anything in previous tickets to help me.

Help is much appreciated.

Thanks, Andy

comment:7 Changed 9 years ago by willie

Hi Andy,

It looks like you're trying to use ECMWF GRIB files. These need to be reconfigured before doing any global runs. Try UMUI job xdkeb. Just run this on your GRIB file and you'll get a start dump with all the correct fields.

I hope that helps.
Regards,

Willie

comment:8 Changed 9 years ago by a.elvidge

Thanks Willie,

I have used xdkeb to reconfigure my dump. I get the following error.

C I/O Error: failed in BUFFIN8
Return code = 1

Despite this, an astart file is still generated. Using this to initialise my global job I get the following error:

lib-4001 : UNRECOVERABLE library error
Unable to find error message (check NLSPATH, file lib.cat)

Encountered during a namelist READ from unit 90
Fortran unit 90 is connected to a sequential formatted text file: "fort.90"

Any ideas as to why I am getting these errors?

Thanks, Andy

comment:9 Changed 9 years ago by willie

Hi Andy,

I have always ignored the BUFFIN8 errors and the start dumps has proved OK in the past.

I am not sure which job you are running, but both xenod and xenoe yield "check setup" error in the UMUI which need to be corrected.

Regards,

Willie

comment:10 Changed 9 years ago by ros

Andy,

If you still get the NLSPATH error once you've fixed the problems identified by "check setup", you'll need to set environment variable NLSPATH in your job, as detailed below, to obtain a more informative message about the namelist problem.


Copy the file /opt/pathscale/lib/3.1/lib.cat into your /work file space and in the UMUI window "Input/Output? Control and Resources" → "Script Inserts and Modifications" set the environment variable NLSPATH to point it.

comment:11 Changed 9 years ago by a.elvidge

Having rectified the 'check setup' problem I am now getting this:

⇒> PBS: job killed: walltime 1210 exceeded limit 1200

Does this mean my job took too long to run? I have just submitted again using more processors but Im not sure if this is the solution?

Thanks, Andy

comment:12 Changed 9 years ago by a.elvidge

With more processors I am still getting the same error.

Andy

comment:13 Changed 9 years ago by willie

Hi Andy,

In the UMUI page Submodel Independent > Job submission you have set a time limit of 1200 seconds. You could increase this until it works. The standard jobs use 3600 seconds, so if you have made noc changes this would be a good starting point.

Regards,

Willie

comment:14 Changed 9 years ago by a.elvidge

Thanks, the global job is running fine now. The 12km LAM also works, however, I have to remove the ozone ancil file for it to run. With ozone configured I get the following error:

Ancillary File does not exist.
File : /work/n02/n02/aelvidge/UMAncil/AP12km/
qrclim.ozone_L38_O29
Got ancillary file name from Env Var OZONE
*
ERROR!!! in reconfiguration in routine Calc_nlookups
Error Code:- 10
Error Message:- Ancillary files have not been found - Check output for details
Error generated from processor 0
*

However, the file /work/n02/n02/aelvidge/UMAncil/AP12km/qrclim.ozone_L38_O29 does exist! Any ideas?

This warning message also appears:

WARNING in reconfiguration in routine rcf_h_int_init_bl
Warning Code:- -10
Warning Message:- Interpolating ozone from zonal to full field
Warning generated from processor 0

Would you recommend setting the ozone to zonal averaged rather than full field?

For the 4km LAM, I am getting this error:

*
ERROR!!! in reconfiguration in routine Rcf_Ancil_Atmos
Error Code:- 207
Error Message:- REPLANCA: PP HEADERS ON ANCILLARY FILE DO NOT MATCH
Error generated from processor 0
*

Any help much appreciated.

Thanks, Andy

comment:15 Changed 9 years ago by grenville

Andy

Please check again I think the O29 at the end of the file name (/work/n02/n02/aelvidge/UMAncil/AP12km/qrclim.ozone_L38_O29) is Oh 29, not zero 29.

Grenville

comment:16 Changed 9 years ago by a.elvidge

Thanks Grenville, stupid mistake. That should have sorted the 12km one, but I did not make the same mistake with the 4km LAM, so that error (see above) remains..?

Thanks, Andy

comment:17 Changed 9 years ago by a.elvidge

Actually, now that I am configuring the ozone ancil file, I am getting the same error for the 12km LAM as for the 4km LAM, suggesting that both have a problem with their ozone ancil files:

*
ERROR!!! in reconfiguration in routine Rcf_Ancil_Atmos
Error Code:- 207
Error Message:- REPLANCA: PP HEADERS ON ANCILLARY FILE DO NOT MATCH
Error generated from processor 0
*

Switching off the ozone ancil allows the 12km job to work.

Cheers, Andy

comment:18 Changed 9 years ago by a.elvidge

Hi,

I am now able to get the 4km LAM to run, though again only with the ozone ancillary switched off. However, looking at the data on xconv, there is something clearly wrong, with most of the output being NaNs? (although 'temperature on theta levels' seems to come out fine). I have tried running the job with a different domain, but get the same data instability.

Any suggestions?

Cheers, Andy

comment:19 Changed 9 years ago by grenville

Andy

There is still something wrong with the reconfiguration of orography gradients and standard deviations. Try switching off these options in the orography screen and reconfigure the orography only.

Grenville

comment:20 Changed 9 years ago by a.elvidge

Hi Grenville,

I tried what you suggested with xenoc and xenoo (both 4km LAMs but with different domains) but am still getting the same problem. Can you think of anything else to try?

Thanks, Andy

comment:21 Changed 9 years ago by grenville

Andy

The surface fields look better. I notice that you have a 100sec timestep. Try reducing that to 30 sec. Your job fails after the first timestep - search the leave file for the word "converge", and you'll see that "RHS zero.." happens at timestep 2. It's worth reducing the run time to say a few minutes and setting cpu time to 1200 secs to get better turn around.

Grenville

comment:22 Changed 9 years ago by a.elvidge

Hi,

Having changed the template jobs and reducing the time step to 30secs, the jobs all now runs successfully. Thanks to all who helped (I guess this ticket can now be closed).

Thanks, Andy

comment:23 Changed 9 years ago by grenville

  • Resolution set to fixed
  • Status changed from new to closed
Note: See TracTickets for help on using tickets.