Opened 3 years ago

Closed 3 years ago

#1976 closed help (answered)

Fields in ancillary files need updating

Reported by: dilshadshawki Owned by: um_support
Component: UM Model Keywords:
Cc: Platform:
UM Version: 8.4

Description

Dear Helpdesk,

I am trying to a run a UKCA job version 8.4, jobid is xlqqg and the .leave file can be found here:

/home/dshawk/output/xlqqg000.xlqqg.d16260.t124054.rcf.leave

Towards the end of the .leave file it states that several fields in various ancillary files need updating (specified in the UMUI in Atosphere → Ancillary and input data files → Climatologies & potential climatologies)

Can you please explain why these files need updating and how this can be done?

Many thanks,
Dill

Change History (29)

comment:1 Changed 3 years ago by grenville

Dill

The program is doing the updating - it's having a problem 'though with an input file possibly /home/dshawk/ancil/for_gaurav/Surf_sin_lev_ancil_09_11? Is that a file you created?

Grenville

comment:2 Changed 3 years ago by dilshadshawki

Yes I thought that might be the file that is creating the problem, I didn't create it per se, but another student that I was working with created it and sent it to me today because originally that file had gone missing from my directory /home/dshawk/anci/for_gaurav.

Do you know how this can be fixed?

Dill

comment:3 Changed 3 years ago by grenville

Please check with MONSooN - they may be able to retrieve your missing file since /home is backed up regularly

comment:4 Changed 3 years ago by dilshadshawki

Ok thanks, do you know who I should speak to specifically? Is it Mohit Dalvi?

Dill

comment:5 Changed 3 years ago by ros

Hi Dill,

Please contact the MONSooN team on monsoon@….

Regards,
Ros

comment:6 Changed 3 years ago by dilshadshawki

Hello,

The MONSooN team were not able to recover the file.

I tried to make the file again using Xancil to convert this netcdf file:

/home/dshawk/ancil/for_gaurav/surf_level_emsns_file_nmv.nc

To this ancil file:

/home/dshawk/ancil/for_gaurav/surf_level_emsns_nmv.anc

But then I get a similar error as above but this time the error in the dot leave file is:
/home/dshawk/output/xlqqg000.xlqqg.d16265.t164540.rcf.leave

IO: Open: /home/dshawk/ancil/for_gaurav/surf_level_emsns_nmv.anc on unit  12
IO: from environment variable USRANCIL
replanca_rcf_replanca: UPDATE REQUIRED FOR FIELD    48 : User Ancillary Field 1
  Error in replanca_rcf_replanca
  CMESSAGE replanca_rcf_replanca: Non-standard period for periodic data
  ErrorStatus  648

????????????????????????????????????????????????????????????????????????????????
???!!!???!!!???!!!???!!!???!!!???!!! ERROR ???!!!???!!!???!!!???!!!???!!!???!!!?
? Error in routine: Rcf_Ancil_Atmos
? Error Code:   648
? Error Message: replanca_rcf_replanca: Non-standard period for periodic data
? Error generated from processor:     0
? This run generated   1 warnings
????????????????????????????????????????????????????????????????????????????????

Not exactly sure what it means by Non-standard period for peridoic data since I kept the settings on Xancil as they were and only needed to update the UM version number (8.4 in this case) and created the ancils by typing in the stash numbers and updating, the corresponding variable just pops up and it's done.

Any ideas on why it's having a problem with the periodicity?

Dill

comment:7 Changed 3 years ago by dilshadshawki

Hi Helpdesk,

Just chasing you up to see if anyone can help me with the above issue?

Many thanks,
Dill

comment:8 Changed 3 years ago by jeff

Hi Dill

Your ancillary file has a header value which tells the model it's a periodic time series, but this is not correct. In xancil there is an option "Is Single-level User ancillary file periodic in time?", this should be set to "no".

Jeff.

comment:9 Changed 3 years ago by dilshadshawki

Many thanks Jeff, I will change this and hopefully it should work. I will let you know how it goes.

Best,
Dill

comment:10 Changed 3 years ago by dilshadshawki

Hi Jeff,

I get another error but the message doesn't explain very much:

/home/dshawk/output/xlqqg000.xlqqg.d16272.t130209.rcf.leave
[NID 00014] 2016-09-28 13:03:45 Exec /projects/ukca-imp/dshawk/xlqqg/bin/qxreconf failed: chdir /work/scratch/jtmp/pbs.817557
.xcm00.x8z No such file or directory

Any ideas as to what the problem could be?

Dill

comment:11 Changed 3 years ago by jeff

Hi Dill

The reconfiguration program hasn't run at all, it looks like something went wrong with PBS. I suggest trying again.

Jeff.

comment:12 Changed 3 years ago by dilshadshawki

Hi Jeff,

The job did get through the reconfiguration stage but while it was running it couldn't produce any output and eventually exceeded the walltime limit. The .leave file gives this error:

/home/dshawk/output/xlqqg000.xlqqg.d16272.t155641.leave
lib-4324 : UNRECOVERABLE library error
  The variable name '0.00000E+00,' is unrecognized in namelist input.

Encountered during a namelist READ from
lib-4324 : UNRECOVERABLE library error
  The variable name '0.00000E+00,' is unrecognized in namelist input.
 unit 166
Fortran unit 166 is
Encountered during a namelist READ fromconnected to  unit 166
a sequential formatted text fileFortran unit 166 is :
  "/projects/ukca-admin/inputs/spectral/radv2.1/nml_ac_sw"
connected to a sequential formatted text file:
  "/projects/ukca-admin/inputs/spectral/radv2.1/nml_ac_sw"

This error repeats itself throughout the .leave file, after looking into that specific file on monsoon, not really sure how to change it as it is in the ukca-admin.

Please help!

Dill

comment:13 Changed 3 years ago by luke

Hi Dill,

Have you checked through the information here:

http://www.ukca.ac.uk/wiki/index.php/MONSooN_IBM_to_Cray_Transition#vn8.4_HadGEM3_GA4.0_UKCA_CheST.2BGLOMAP-mode_RJ4.0

The new Cray compiler required the namelist format to be changed, so there are new versions of these files. Try appending _new to the end of each of these RADAER files.

Also, make sure you're using the hand-edit ~ukca/hand_edits/VN8.4/raderv2.1_vn84_MONSooN.ed. You should be, as you are using RADAER v2.1.

There may be other changes that you need to make as well.

Thanks,
Luke

comment:14 Changed 3 years ago by dilshadshawki

Hi Luke,

Thank you for sending me that link, it was very helpful. You are correct in that this job was originally run on the ibm machines and it seems like not all of the changes had been made correctly. I have made the the changes to the RADAER files now and I will get back to you on the progress, hopefully should all work fine.

Thanks again,

Dill

comment:15 Changed 3 years ago by dilshadshawki

Hi Luke,

The job now manages to get past the reconfiguration stage but gives little information as to why it's crashing. Only that the walltime has been exceeded.

See here:

/home/dshawk/output/xlqqg000.xlqqg.d16274.t150510.leave

Any ideas on what's happening? This job has worked before in the past when it was run on the cray so I'm guessing the only issue lies in the fact that I've changed those settings to the correct ones and I changed the diagnostics to be outputted via STASH.

Dill

comment:16 Changed 3 years ago by dilshadshawki

Apologies, that was meant to say: the job has worked before when it was run on the ibm machine.

Dill

comment:17 Changed 3 years ago by dilshadshawki

Hi Luke/Helpdesk?,

Please could you help me with the issue above?

Many thanks,
Dill

comment:18 Changed 3 years ago by luke

Hi Dill,

I'm very sorry for not replying. The error you're getting is

PBS: job killed: walltime 10833 exceeded limit 10800

i.e. the job ran out of time. You would need to reduce the number of timesteps run in the 3-hours available to run in, or possibly increase the number of nodes to see if that improves matters.

Are you still having this issue?

Thanks,
Luke

comment:19 Changed 3 years ago by dilshadshawki

Hi Luke,

I decreased the number of time steps from 72 to 48 (20 to 30 minute timestep)
I also increased the number of processers (job submission method, increased processer number from 12 to 16 (east-west) and from 16 to 20 (north-south)) and this increased the number of nodes from 6 to 10 nodes.

This time the job manages to output 24 days before crashing on the 25th day and the error in the .leave file is the same.

Is there anything else I can do?

Best,
Dill

comment:20 Changed 3 years ago by luke

Hi Dill,

I'm sorry I wasn't clear above. I didn't mean change the model timestep I meant perform less timesteps in the time allowed. You shouldn't change the timestep (as you have done, by increasing the length of it) as, for L85 jobs, a 30-minute timestep is too long and can lead to instabilities. Also, changing the number of processors needs to be tested incrementally to see what is best, as if you increase it too much it will be inefficient, or even occasionally increase the run length! See the UKCA website for examples done on the old MONSooN IBM and ARCHER.

What I meant for you to try was initially only run for 20 days rather than 30 in a job-step. Take your original job go to

Model Selection -> Input/Output Control & Resources -> Resubmission Pattern

and set the target run length to be 0 months and 20 days. This should mean that the job will fit in the 3-hour queue.

Thanks,
Luke

comment:21 Changed 3 years ago by dilshadshawki

Hi Luke,

Thank you for clarifying. I changed the initial run to 20 days as you advised and returned the time step back to 20-minutes and the number of processors back to the original numbers. However, the run still crashes at 16 days. Would you advise I increase the processors this time in the hope that it will get to 20 days? Else, any other ideas?

Thanks,
Dill

comment:22 Changed 3 years ago by luke

Hi Dill,

What is the error you get - is it the walltime limit exceeded, or something else?

Thanks,
Luke

comment:23 Changed 3 years ago by dilshadshawki

Yes exactly, walltime limit exceeded.

DIll

comment:24 Changed 3 years ago by luke

OK - try 10 days and see what happens.

L

comment:25 Changed 3 years ago by dilshadshawki

Hi Luke,

It successfully ran for 10 days and I checked the output is all there. So now I am going to do a continuation run since the job is meant to run for two years. So doing a continuation run with 10 days resubmission mean that it will automatically resubmit for processing every 10 days, so it would run slower but at least it won't crash, right?

THanks,
Dill

comment:26 Changed 3 years ago by luke

Hi Dill,

Yes, that's right. It should run fine, just slowly.

Can I check exactly what changes you've made to the release job to make it run this slowly? It should fit 30 days in the 3 hour queue if you haven't changed much.

Thanks,
Luke

comment:27 Changed 3 years ago by dilshadshawki

HI Luke,

The main changes were in the stash, but its only outputting around 10 diagnostics. This isn't actually from the CRAY rleease job, it was originally from an old IBM release job with modifications made to the emissions ancillary files.

I hope that helps.

Many thanks,
Dill

comment:28 Changed 3 years ago by dilshadshawki

Hi Luke,

The run only managed another 10 days, so it managed to reach 20 days of output. I checked the variables on the 20th day and they are all there. I also checked the job was set as a continuation run. Same error as before, walltime exceeded.

Could there be anything else that can be done?

Best,
Dill

comment:29 Changed 3 years ago by ros

  • Resolution set to answered
  • Status changed from new to closed
  • UM Version changed from <select version> to 8.4

Closing now as issue continued in #2129

Note: See TracTickets for help on using tickets.