Opened 7 years ago

Closed 7 years ago

#1139 closed help (fixed)

Errors when trying to run HadGEM2 AMIP job

Reported by: charlie Owned by: grenville
Component: UM Model Keywords:
Cc: Platform: HECToR
UM Version: 6.6.3

Description

Hi,

I'm trying to run a HadGEM2 AMIP run (xiogc), which I copied from your standard job xgadd. I thought I had changed all the relevant things, including a new start dump (and corresponding start date), but something has clearly gone wrong as it fell over more or less straightaway.

I can see a number of errors in the .leave file, but as usual I'm not sure which one is causing the problem. I think it might be at the top where I get an Error code 4 and Error message "INANCILA: integer header error" but don't know what this means, if indeed that's the relevant problem?

Please can you help? I have attached the .leave file, for your information.

Many thanks,

Charlie

Attachments (12)

xiogc000.xiogc.d13260.t113419.leave (409.9 KB) - added by charlie 7 years ago.
.leave file for job xiogc
xiogc000.xiogc.d13263.t142604.leave (410.0 KB) - added by charlie 7 years ago.
xiogc000.xiogc.d13270.t111657.leave (410.9 KB) - added by charlie 7 years ago.
OK.png (3.5 KB) - added by grenville 7 years ago.
BAD.png (3.6 KB) - added by grenville 7 years ago.
xiogc000.xiogc.d13270.t165655.leave (400.1 KB) - added by charlie 7 years ago.
xiogc000.xiogc.d13277.t174451.leave (3.4 KB) - added by charlie 7 years ago.
xiogc000.xiogc.d13280.t110542.leave (3.3 KB) - added by charlie 7 years ago.
xiogc000.xiogc.d13287.t152607.leave (3.5 KB) - added by charlie 7 years ago.
xiogd000.xiogd.d13295.t163447.leave (27.6 KB) - added by charlie 7 years ago.
xioge000.xioge.d13295.t163533.leave (27.6 KB) - added by charlie 7 years ago.
xiogf000.xiogf.d13295.t163613.leave (27.9 KB) - added by charlie 7 years ago.

Download all attachments as: .zip

Change History (67)

Changed 7 years ago by charlie

.leave file for job xiogc

comment:1 Changed 7 years ago by grenville

Charlie

The error has arisen because the sst ancillary file has data on a different grid than then the model. The ancillary data has 360x216 grid points whereas the model is expecting 192x145.

Grenville

comment:2 Changed 7 years ago by charlie

Grenville,

Very many thanks, I knew they were different but thought the model would recalculate from the start dump. Obviously not! Is it best to change the start dump to match the model, or is there somewhere in the UMUI that I can make the model match the start dump?

Charlie

comment:3 Changed 7 years ago by grenville

Charlie

The job you have is configured for N96 L38 resolution. I think the best thing is to regrid the ancillary fields so that they are on the same grid as the model.

Grenville

comment:4 Changed 7 years ago by charlie

Thanks very much Grenville. I've now regridded the SST and sea ice ancillary files so that they match the model, however it has fallen over again, this time generating an Error code 8 and Error message "INANCILA: real header error". Again, I'm afraid I don't know what this means, if indeed that's the relevant problem?

Please can you help? I have attached the latest .leave file, for your information.

Many thanks,

Charlie

Changed 7 years ago by charlie

comment:5 Changed 7 years ago by grenville

Charlie

The error comes about because the grid in the ancillary file isn't quite the same as in the model. The first and second groups of 6 numbers in the 12 below should match:

1.875, 1.25, 90., 0.9375, 90., 0., 1.875, 1.25, -90., 0., 90., 0.

6977 INANCCTL: Error return from INANCILA 8

I can see from xconv that your ancillary file has values for first longitude, first latitude which are not consistent with the model grid. It appears that you have created an ancillary file for old dynamics (I have just noticed that xconv explicitly reports this fact in its bottom-right panel) - just change the version in xancil and get the grid paramaters from the N96 grid from xconv.

Grenville

comment:6 Changed 7 years ago by charlie

Dear Grenville,

Thanks very much, and apologies for that error. I've now corrected the ancillary file latitudes and longitudes, as well as changing the version to 6.6 in xancil.

I've now resubmitted my job (this morning) and now it appears to have run. At least, something has appeared! However, something is clearly wrong - in my directory /work/n02/n02/cjrw09/result which is where (I think) my output is going, I have the files below - they all have size, but when viewed in xconv they are all apparently empty.

-rw-r—r— 1 cjrw09 n02 3.5M 2013-09-23 16:05 xiogca.paf0jan
-rw-r—r— 1 cjrw09 n02 2.1M 2013-09-23 16:05 xiogca.pcf0jan
-rw-r—r— 1 cjrw09 n02 2.5M 2013-09-23 16:05 xiogca.pdf0jan
-rw-r—r— 1 cjrw09 n02 2.1M 2013-09-23 16:05 xiogca.pef0jan
-rw-r—r— 1 cjrw09 n02 3.0M 2013-09-23 16:05 xiogca.pff01b0
-rw-r—r— 1 cjrw09 n02 3.5M 2013-09-23 16:05 xiogca.pgf01b0
-rw-r—r— 1 cjrw09 n02 3.5M 2013-09-23 16:05 xiogca.pif0jan
-rw-r—r— 1 cjrw09 n02 2.1M 2013-09-23 16:05 xiogca.pjf0jan

I've clearly done something silly, perhaps in my stash, so apologies but could you advise?

Many thanks,

Charlie

comment:7 Changed 7 years ago by charlie

Dear Grenville,

Further to my message above… Having looked more closely at these files, I see that they all correspond to January 1950 (the first month of my run), apart from 2 corresponding to 11 January 1950. Does the .pa, .pc, .pd correspond to different data streams? But if so, why are they all apparently empty when viewed?

Hope to hear from you soon,

Many thanks,

Charlie

comment:8 Changed 7 years ago by grenville

Charlie

The problem is still with with the ancillary data. It appears that your ancillary files have the flag set to say that the data is periodic in time - the files which work OK for the umui AMIP job say that the data is a time series - which it is. Could you try rebuilding the ancillary files, but where it asks if the SST ancillary data is periodic in time, choose the NO option, and the same for the sea ice file. I am not sure this will solve the problem, but the route through replanca (the code which does the ancillary updating) takes different paths which depend on whether the data is periodic or time series.

Grenville

comment:9 Changed 7 years ago by charlie

Dear Grenville,

Thanks ever so much. I've now done as you suggested, rebuilding the ancillary file but saying not periodic in time. I submitted my job this morning, and it's already fallen over (hasn't even generated any model output this time). In my .leave (attached), I'm getting an error code 8 this time with message "INANCILA: REAL header Error" so presumably there is still something wrong with the ancillary files?

Sorry about this,

Charlie

Changed 7 years ago by charlie

comment:10 Changed 7 years ago by grenville

Charlie

Yes - the number below don't match

Ancillary data file 6 , unit no 35 , SEA SURFACE TEMPERATURES

1.875, 1.25, 90., 0., 90., 0., 1.875, 1.25, -90., 0., 90., 0.

The attached file OK.png has the grid data for the AMIP sst file for the standard job, the file BAD.png has the data in your ancillary file - the values you have for first latitude and row spacing have sign errors (these are reflected in the mismatches shown above, taken from the leave file).

Grenville

Changed 7 years ago by grenville

Changed 7 years ago by grenville

comment:11 Changed 7 years ago by grenville

Ooops forgot to include the files - attached now.

Grenville

comment:12 Changed 7 years ago by charlie

Grenville,

Okay, understood, and apologies for the error - I'm particularly frustrated with myself about this, as every time I have built this ancillary file I thought I had checked it very very carefully. Clearly not.

So if I understand your message correctly, the problem is with the sign of the first latitude and row spacing. It should be -90.0 and 1.25 respectively, whereas mine is 90.0 and -1.25 respectively. I understand that.

What I don't understand is that my original regridded .nc file (at /work/n02/n02/cjrw09/ancil/hydro.d/tos_day_HadGEM2-ES_historical_r1i1p1_19491201-20051130_regridded.nc) is correct. My ancillary file (sst_h2_hist_day_1950-2005, at the same location) is not. So the sign is being reversed at some point during the xancil process, not before that. I've looked at xancil and there doesn't seem to be anywhere to specify these values, so how do I stop this from happening?

Charlie

comment:13 Changed 7 years ago by grenville

Charlie

Have you saved the xancil job - if so please point me to it.

Grenville

comment:14 Changed 7 years ago by charlie

Dear Grenville,

I think I'm going mad. I have just rerun xancil, just in case I did something stupid this morning, and now it's fine. My new ancillary file matches my .nc file, both of which have -90 and 1.25 respectively for the first latitude and row spacing.

Just in case I've still done something stupid - which seems quite likely at present - my jobs can be seen at /work/n02/n02/cjrw09/ancil/hydro.d/xancil_sst270913.job and /work/n02/n02/cjrw09/ancil/hydro.d/xancil_ice270913.job for SST and sea ice respectively.

I'll try resubmitting my job now…

Thanks a lot,

Charlie

comment:15 Changed 7 years ago by charlie

Dear Grenville,

Okay, my job has now run after rebuilding the ancillary file yesterday as you suggested, and that seems to have resolved that particular problem. It has fallen over again however, this time generating an error code 439 with message "REPLANCA: Current time proceeds start time of data" (.leave attached).

Presumably this is because my start date either doesn't match one of the ancillary files, or doesn't match my start dump? Not sure why this is, however, as my start date is 1 January 1950. My start dump matches this, and the SST/sea ice ancillary files I created begin in 1949. So sorry again, but I don't know what this problem means…

Charlie

Changed 7 years ago by charlie

comment:16 Changed 7 years ago by grenville

Charlie

The problem isn't with your new ancillary files. Since you changed the start date for the run to 1950, the sulphur dioxide emissions forcing files are causing the error since that data starts in 1970. I haven't checked all the forcing data files, so there may be other data with dates which aren't appropriate for a 1950 run.

Grenville

comment:17 Changed 7 years ago by charlie

Dear Grenville,

Thanks very much. I've now gone through all the other ancillary files, and have identified the ones that begin after my start dump - for many of them it was obvious from the title, but I checked in xconv anyway just to be sure. All of the others appear to be just a single timeslice independent of year, or a 12 month seasonal cycle. The 5 problem files are:

BC_hi_HCA_1970_2010.N96
Bio_HCA_1970_2010.N96
dist_1700_2005
OCFF_HCA_1970_2010.N96
sulp_HCA_1970_2010.N96

Could you advise on how I might go about finding alternatives for these 5, which begin further back in 1950?

Many thanks,

Charlie

comment:18 Changed 7 years ago by charlie

Dear Grenville,

I have now resubmitted my job but starting earlier on 1 Jan 1970. It immediately fell over generating the same error as before, but I realised that this is because the 4 ancillary files beginning in 1970 don't actually start until 16 Jan 1970, so same problem.

So I resubmitted it again starting on 1 Jan 1971, and have this time got the error below. I have attached the file just in case this is not the actual problem. Please can you advise again?

Many thanks,

Charlie

—-

aprun: Apid 5924943: Caught signal Terminated, sending to application
/var/spool/PBS/mom_priv/jobs/1718812.sdb.SC[304]: .: line 277: 14572: Terminated
-ksh: line 1: 14247: Terminated
_pmiu_daemon(SIGCHLD): [NID 02831] [c1-0c2s7n1] [Sat Oct 5 12:03:56 2013] PE RANK 93 exit signal Terminated
_pmiu_daemon(SIGCHLD): [NID 02866] [c1-0c2s6n2] [Sat Oct 5 12:03:56 2013] PE RANK 55 exit signal Terminated
_pmiu_daemon(SIGCHLD): [NID 02502] [c3-0c2s3n2] [Sat Oct 5 12:03:56 2013] PE RANK 1 exit signal Terminated

Changed 7 years ago by charlie

comment:19 Changed 7 years ago by grenville

Charlie

The job appears to have run out of time - I don't see any output files, which is odd. I'd suggest running the job for a few days only and switch on some more output, navigate to sub-model independent→output choices and select Extra diagnostic messages to check that the set up is OK before trying to run for a long period. Reduce the time required to run to run the job (say 3600) to get better turn around.

Grenville

comment:20 Changed 7 years ago by charlie

Dear Grenville,

Many thanks. I did as you suggested this morning - I selected Extra diagnostic messages, and changed the job time limit (in sub-model independent > job resources resubmission…) to 3600 seconds. In the NEXT window to this, I also changed the job time limit for QSUB to 3600 seconds as well.

However, a couple hours later, it has already fallen over. There are output files in work/result but they are again all empty. The error in the .leave file (attached) is identical, or at least very similar, to the last error.

Sorry about this,

Charlie

Changed 7 years ago by charlie

comment:21 Changed 7 years ago by grenville

Charlie

The problem only happens when you include updating of the ssts or ice. I am not sure what's happening - the size of the ancillary files is causing some problems - the model keeps running out of space for LOOKUP headers. I think it best to back up a little and approach the problem in small steps. Could you create ancillary files for a much shorter period and try running the model for a short time?

Grenville

comment:22 Changed 7 years ago by charlie

Dear Grenville,

Apologies for the delay.

I've now tried shortening my ancillary files so that they begin in 1971, which is my start date anyway. I have resubmitted my job, but it has again fallen over straightaway with the same error.

I can't really make my ancillary files any smaller, because we need to run for at least 30 years. I'm surprised it doesn't like this, because presumably AMIP runs use daily data starting in 1979, so my files aren't much larger.

Is there not some sort of parameter in the umui where I can increase the space for lookup headers? Or is this not possible?

Many thanks,

Charlie

comment:23 Changed 7 years ago by charlie

Dear Grenville,

Further to my last email… I have found the relevant (I think) window in the UMUI to specify the header dimensions, in Atmosphere > Ancillary and input data files > In file related options > Header record sizes. Currently, the maximum total number of lookup headers is set to 4000. Should I increase this, and if so to what? Are there any other implications of changing this?

Many thanks,

Charlie

comment:24 Changed 7 years ago by grenville

Charlie

The number to go there can be calculated as the number of fields x number of levels x number of update times for all ancillary files.

I am not 100% sure setting this will solve the problem. I was suggesting to do a short run to eliminate the possibility that that header record size was having any effect at all.

Grenville

comment:25 Changed 7 years ago by charlie

Dear Grenville,

Okay, I understand. I have now reduced my 2 ancillary files to one year only, i.e. 360 time slices, and have tried again running for 10 days only. It has fallen over again, generating exactly the same error.

Based on what you said in your last message, I'm a little confused as to how 4000 would ever be enough. Imagining for a second that I was running it using just the standard ancillary files, I have gone through all of the ancillary files in /work/n02/n02/hum/hg6.6.3/HG2AMIP_ancils and (assuming they are all used) have worked out that the total comes to 10430. So 4000 isn't even enough for this, let alone including my own (very large) SST and sea ice files. I can only assume that not all of these ancillary files are used in this job?

What should I do next?

Thanks,

Charlie

comment:26 Changed 7 years ago by grenville

Charlie

I came to the same conclusion yesterday. The model ran fine without your sst/ice updating - I changed the number of headers from 4000 to 50000 and it failed saying there wasn't enough space for the headers!

It turns out that there is a ludicrous hand edit (~umui/hadgem2/handedits/ancil_head.ed) which says if the number of headers is set to 4000 in the umui, then make it 500000 in the run.

This has come about because there is a limitation on the number of headers in the umui and instead of fixing the umui, the hand edit hack has been introduced.

To get your job running quickly - please take a copy of the hand edit and change the value to 2000000 ( I ran your job with the 1970-2005 ancils with this number OK - I guessed the number of headers).

Grenville

comment:27 Changed 7 years ago by charlie

Dear Grenville,

Very many thanks for finding this out.

I have done as you suggested, copying the file and putting it into ~charlie/um/hand_edits/ancil_head.ed and then changing the value from 500000 to 2000000. I then changed the relevant line in Sub model independent > User hand edit files, to point to this new file.

However, when I try to process the job, I get an error box saying "~charlie/um/hand_edits/ancil_head.ed is not an executable. For more information look at EXT_SCRIPT_LOG file"

What does this mean?

Charlie

comment:28 Changed 7 years ago by charlie

Dear Grenville,

Further to my last email… I think I have fixed this problem myself (amazingly!) Having looked at the EXT_SCRIPT_LOG, it said permission denied - so I opened up permissions on that file, processed again and the error no longer appears. So I have now resubmitted my job, to see if it runs…

Charlie

comment:29 Changed 7 years ago by charlie

Dear Grenville,

I don't understand this. My job has now run again, and has fallen over at the same point generating the same error, despite doing as you suggested with the hand edit. If you managed to get my job running, what have I done wrong?

Charlie

comment:30 Changed 7 years ago by grenville

Charlie

My job has AMIP-II method of updating switched on - I simply inherited the setting from the standard job.

Grenville

comment:31 Changed 7 years ago by charlie

Dear Grenville,

Okay, thanks. I have now resubmitted my job, but this time with that method of updating switched on. I'll let you know whether or not it runs…

I am slightly uncertain, however, what this actually does? It was my understanding, possibly incorrect, that I needed to have this switched off because I'm not using the standard SST ancillary file. Is that not right?

Charlie

comment:32 Changed 7 years ago by charlie

Dear Grenville,

Further to my last email… Success! It has now run successfully for a month, and stopped as expected.

I did have just one question, about my output - I seem to have lots of different output files, which presumably correspond to the usage streams - I have checked some of them in my stash in the UMUI but am struggling to see which one corresponds to which output letter. All I'm really interested in is the daily output (which I think is .pa such as xiogca.pah1jan) and the monthly means (which I think is .pm such as xiogca.pmh1jan) - is that right? It's not a problem having the others and I can just delete them later, I just don't want to fill up Hector with files I don't need (given that I want to ultimately do a 30 year run).

However, a much more urgent problem is resubmitting. I have gone into the SUBMIT file and changed TYPE=NRUN to TYPE=CRUN as well as STEP=2 to STEP=4 but when I resubmitted this I got the following error:

"You have selected a compilation step and a continuation run CRUN.
This is not allowed. Please modify your UMUI settings.
For quick fix set RCF_NEW_EXEC to false in SUBMIT file"

So, hoping to resolve this without bothering you, I did as it is suggested and changed the relevant line (line 108) to false.

This generated a different error: FCM_MAIN: Submit failed.

So I've clearly done something very wrong here, but have no idea what. Please can you help?

Thanks a lot,

Charlie

comment:33 Changed 7 years ago by charlie

Dear Grenville,

And it was all going so well!

I resubmitted my job straight after talking to you, as suggested changing so that it ran from the existing executable for both compilation and reconfiguration and then manually changing from NRUN to CRUN.

It ran successfully for another month, and has then stopped. My job was set to run for 3 months, but I only have valid output data for January and February. The output files for March have been written out, but they are empty.

I have found what I guess is the relevant error in the .leave file (attached):

Dumpfile Size for File xiogca.dah12b0 on Unit 22 to be set to 230348800 Words…

So is this saying my restart dump is too big? How can I resolve this?

Apologies once again about this,

Charlie

Changed 7 years ago by charlie

comment:34 Changed 7 years ago by grenville

Charlie

Your job is set to run for 3 months with automatic resubmission on, but with target run length for each job in sequence set to 1 year, so it might be confused.

I copied your job, set the run length to 1 year with each job in the sequence to run for 1 month (these fit in the 1hr queue) and it ran OK (pl see /work/n02/n02/grenvill/result for job xiqdc).

Grenville


comment:35 Changed 7 years ago by charlie

Grenville,

Very many thanks, I understand.

As I may have said to you before, ultimately I want to run for 30 years. Should I still keep it as one month for each job in the sequence, or is this not appropriate for a run of this length?

Charlie

comment:36 Changed 7 years ago by grenville

Charlie

It's a bit difficult to say which queues will give fastest turn around - I'd experiment a little.

Grenville

comment:37 Changed 7 years ago by charlie

Many thanks, will do…

Charlie

comment:38 Changed 7 years ago by charlie

Dear Grenville,

I'm really sorry about this, but I have one more urgent question about my run - a test I did yesterday has worked, but I really don't understand why it has stopped where it has.

I submitted my job yesterday with a run length of 1 year - using a job time limit (and job time limit for QsUB) of 21600 seconds i.e. 6 hours and specifying the target run length for each job in sequence as 7 months. I did this because, looking at a previous year's worth (in your directory), I saw that it was writing out each monthly file every 48 minutes, so it would be able to do 7 months in 6 hours. I was advised that using the 6 hour queue would give better turnaround.

I expected it to run for one month then stop, I would then need to change NRUN to CRUN in SUBMIT before submitting it again, and it would then run for the rest of the year.

Instead, it appears to have run for 7 months straightaway, without changing NRUN to CRUN, then it has stopped.

I have clearly misunderstood something here, so what have I done wrong?

Charlie

comment:39 Changed 7 years ago by grenville

Charlie

The job runs in chunks of the run length of the automatic resub sequence. So your NRUN will run for 7 months, subsequent CRUNS will run for 7 months each.

Grenville

comment:40 Changed 7 years ago by charlie

Dear Grenville,

Very many apologies, another problem.

My 30 year run is now running, no problem. I want to now run 3 more jobs, identical to the first but with different start dates (to give myself an eventual total of 4 ensemble members, each 30 years long).

My original run, xiogc, begins in January 1971, using a start dump at /work/n02/n02/cjrw09/dumps/xhgzha.dah1110. So I have copied this job 3 times, getting 3 new job IDs xiogd, xioge and xiogf. In each, I have changed the start date to January 1972, 1973 and 1974 respectively, using start dumps xhgzha.dah2110, xhgzha.dah3110 and xhgzha.dah4110 all at the above location.

However, upon trying to submit these, none have worked. The 2nd and 3rd runs didn't even make it as far as the queueing stage. The 4th run began queueing, but has now disappeared.

I guess I've done something stupid, again, but I'm afraid I don't know what, sorry. Nothing else has been changed between the jobs, and all of the runs begin and end within the time limits of all the ancillary files. What have I done wrong? I have attached the 3 new .leave files.

Charlie

Changed 7 years ago by charlie

Changed 7 years ago by charlie

Changed 7 years ago by charlie

comment:41 Changed 7 years ago by grenville

Charlie

jobs xiogd,e,f say use an existing executable but then don't point to one. Point the jobs to the xiogc executable - you may need to create /work/n02/n02/…/xiogd,e,f directories - I think that was needed before.

Grenville

comment:42 Changed 7 years ago by charlie

Yes of course, sorry Grenville, I should have realised that. I completely understand the error.

Upon checking my run this morning (the one that has been running successfully for the last few days) I find it has stopped. The relevant (I think) error in the .leave file (attached) is a little worrying:

BUFFOUT: Write Failed: Disk quota exceeded

*
UM ERROR (Model aborting) :
Routine generating error: UM_WRITDUMP
Error code: 400
Error message:

Failure writing out field
*

Does this mean what I think it does? In which case, how do I resolve this problem - it has only got as far as 5 years into a 30 year run?

Charlie

comment:43 Changed 7 years ago by charlie

… Further to my last message, I can't attach the .leave file as its too big. It can be found at /home/n02/n02/cjrw09/um/umui_out/xiogc007.xiogc.d13296.t203058.leave if it helps?

comment:44 Changed 7 years ago by grenville

Charlie

I have increased your disc quota but space on HECToR is very limited. Please move data to the RDF as soon as possible.

Grenville

comment:45 Changed 7 years ago by charlie

Dear Grenville,

Sorry about this, but I have another problem with one of my runs - and it was all going so well!

I am running 4 jobs at once, and they are all approximately halfway through a 30 year run. It has fallen over a couple of times because of a lack of space, but I have been transferring the data every day and have just restarted it each time once there's more room.

However, this time, upon resubmitting my 4 jobs, 3 have begun to run but the 4th hasn't. It has given me the following error in the .leave file (attached), but I don't know what it means.

Can you help again?

Charlie

BUFFIN: Read Failed: No such file or directory

*
UM ERROR (Model aborting) :
Routine generating error: Atm_Step
Error code: 2
Error message:

Failed to gather field

comment:46 Changed 7 years ago by charlie

Further to my last message… I can't attach the .leave file because it's too large, but you can see it at /home/n02/n02/cjrw09/um/umui_out/xiogf000.xiogf.d13308.t141659.leave if needed.

Charlie

comment:47 Changed 7 years ago by grenville

Charlie

I have run a test job with automatic archiving - it seemed to go OK. Please look at job xiqdc. You'll need to include the branch fcm:um_br/dev/jeff/HG6.6.3_hector_monsoon_archiving/src, and the hand edit ~jeff/umui_jobs/hand_edits/archiving_6.6.3. Change the archiving destination to your preferred /nerc directory and set the archiving flags for the streams you want. You'll also need to get an lms account (through SAFE) and set up ssh keys (please see http://cms.ncas.ac.uk/wiki/Hector/NercArchiving - this refers to UM 7.3, but applies equally to 6.6.3 with the appropriate substitutions for version.

Grenville

comment:48 Changed 7 years ago by grenville

Charlie

Your model went wrong at time step 22409 when it failed to converg. Subsequent errors (COEX error) stem from this.

It's not clear why the model blew up. You could try reconfiguring its most recent start file and running on from that - this might be enough to smooth out some numerical problems.

Grenville

comment:49 Changed 7 years ago by charlie

Dear Grenville,

Thanks very much for both your messages.

About the automatic archiving: I'm assuming this is something I can't do mid-run? So that will have to wait until my next run (once the current ones are finished). But as soon as they are, I will do as you suggested for this.

About the more immediate problem, with my run failing: sorry to ask the simple question, but how do I reconfigure the most recent start file? Is it just a case of finding the last available restart dump, putting that as the start date in the UMUI, reprocessing, changing NRUN to CRUN, and then submitting? Or is there something else involved?

Charlie

comment:50 Changed 7 years ago by annette

  • Owner changed from um_support to grenville
  • Status changed from new to assigned

comment:51 Changed 7 years ago by charlie

Dear Grenville,

Sorry to bother you again, but my latest job has again failed almost at the end - in fact one month away from the end of its run. I can see the error below in the .leave file, but not sure if this is the right one or indeed what it means. The rest of my .leave file is too large to attach here, but can be seen at /home/n02/n02/cjrw09/um/umui_out/xiogf029.xiogf.d13322.t210617.leave

Please can you help?

Many thanks,

Charlie

*
UM Executable : /work/n02/n02/cjrw09/xiogf/bin/xiogf.exe
*

apsched: the confirmed user ID is different from this claim's user ID
xiogf: Run failed
*

Ending script : qsmaster
Completion code : 1
Completion time : Mon Nov 18 21:09:12 GMT 2013

*

/work/n02/n02/cjrw09/xiogf/bin/qsmaster: Failed in qsmaster in model xiogf
*

Starting script : qsfinal
Starting time : Mon Nov 18 21:09:12 GMT 2013

*

/work/n02/n02/cjrw09/xiogf/bin/qsfinal: Error in exit processing after model run
Failed in model executable

comment:52 Changed 7 years ago by grenville

Charlie

This is the error

apsched: the confirmed user ID is different from this claim's user ID

This a known HECToR problem with the scheduler - the solution is simply to resubmit.

Grenville

comment:53 Changed 7 years ago by charlie

Dear Grenville,

Sorry to bother you again, but I'm having trouble with my next run (all others, ie. all 4 ensemble members, worked absolutely fine and finished properly).

The only difference with this new run is that I have replaced the standard soil moisture/snow depth ancillary file with the output soil moisture/snow depth from my last run. So I have created a new ancillary file, containing daily SM, and am using this to force the model. I'm fairly sure the ancillary file is correct (checking all the errors I made last time), and I have made sure it is updated every day.

The model has run successfully for the 1st month, has written out the first start dump of the 2nd month, and has then fallen over. I have looked at the .leave file but can't see any obvious error - the file (which is too large to attach here) can be seen at /home/n02/n02/cjrw09/um/umui_out/xjfja000.xjfja.d13330.t135505.leave

Can you help?

Thanks a lot,

Charlie

comment:54 Changed 7 years ago by grenville

Charlie

The job is set up to run for 3 months in 1 month 'chunks'. The first month ran OK - you need to resubmit as a CRUN to get the next 2 months (switch on the hand edit)

Grenville

comment:55 Changed 7 years ago by grenville

  • Resolution set to fixed
  • Status changed from assigned to closed
Note: See TracTickets for help on using tickets.