Opened 5 years ago

Closed 4 years ago

#1414 closed help (fixed)

HadGEM2 failure

Reported by: charlie Owned by: annette
Component: UM Model Keywords:
Cc: Platform: ARCHER
UM Version: 6.6.3

Description

My job that we discussed a few weeks ago has been running, but has got to a certain point (February 1978, to be exact) and won't go any further. I have tried restarting from several of my start dumps, e.g. 1 February 1978 and 1 January 1978, but it's always gets to the same point, then stops.

I don't understand what's gone wrong. I have looked at the .leave file (attached) and although I can see errors, I don't know which one is the important one.

One of the reasons reason I don't understand this problem is that I am currently running 4 jobs at once: xkmna-d. These correspond to 4 ensemble members of the same job, so they are absolutely identical apart from the initial start file used - they all start in 1971, but xkmna reconfigures the 1971 start file, b the 1972 start file, etc. The problem is only occurring with xkmna - all of the other jobs have got past February 1978 and are running fine. So given that they are all identical, why are the others working but not xkmna?

Further to my last message, I've just discovered my attachment is too large. You can find it at /home/n02/n02/cjrw09/um/umui_out/xkmna000.xkmna.d14335.t170822.leave

Thanks a lot,

Charlie

Change History (22)

comment:1 Changed 5 years ago by annette

  • Owner changed from um_support to annette
  • Status changed from new to accepted

Hi Charlie,

There's no obvious error message from the UM. Since your other jobs are OK, it looks like the problem might be numerical. There are some things you can do to help diagnose the issue:

  • In Sub-model independent → Script inserts and modifications add the environment variable ATP_ENABLED with value 1. This should provide a trace-back when the model crashes.
  • In Sub-model independent → Output choices switch on "Extra diagnostic messages".
  • In Atmosphere → Control → Post-processing, Dumping and Meaning switch on "Irregular dumps", select "Next", then edit the table to create dumps for a few of the time-steps right before the model crashes.

Regards,

Annette

comment:2 Changed 5 years ago by charlie

Hello Annette,

Sorry for the delay. I have now resubmitted my job (from January 1978 again), and it has again fallen over at exactly the same place (end of February 1978). I followed your instructions with this submission - the only one I couldn't do was to set up irregular dumps in your 3rd point, because this created loads of conflicts with my stash. I did the others, however.

My latest output, with extra diagnostic messages, is at /home/n02/n02/cjrw09/um/umui_out/xkmna000.xkmna.d14343.t112932.leave

Does this help?

Charlie

comment:3 Changed 5 years ago by annette

Hi Charlie,

I took a copy of your job and started from the last dump created - 21 Feb 1978. It still fails in the same place but this speeds up the run. I then switched off all STASH items to get the irregular dumps to work.

I couldn't see anything obviously wrong with the dumps or the diagnostic prints. Often with this kind of numerical instability, NaNs start appearing in the data.

I was however able to get past the crash by changing the time stepping from 48/day to 72/day (see my job xhnkk). You may be able to change the time stepping back once the model has settled down.

Other things that can work in this situation are using a different start file or perturbing your start file (reconfiguring didn't perturb things enough to avoid the crash).

Annette

comment:4 Changed 5 years ago by charlie

Many thanks, and sorry for the delay.

I've now changed the time stepping to 72/day as you suggested, and have submitted the job, so let's hope that gets it past the crash.

Just one question: what impact, if any, will this have on the output?

Charlie

comment:5 Changed 5 years ago by charlie

Dear Annette,

Hope you had a great Christmas and good start to the New Year.

Sorry it's taken me so long to get back to you about this - however, I have now returned to trying to get this job to work. I have done as you suggested, changing the time-stepping from 48 to 72 hours, and resubmitted my job - but again it fell over more or less straightaway. It's not even getting to that same point (February 1978) any more, but falling over straightaway. Would you mind taking a look at my latest .leave file, at /home/n02/n02/cjrw09/um/umui_out/xkmna000.xkmna.d15010.t113651.leave

Thanks,

Charlie

comment:6 Changed 5 years ago by annette

Charlie,

There is an error very near the start of the .leave file:

 Error code:  4
 Error message: 
STWORK   : NO. OF FIELDS EXCEEDS RESERVED HEADERS                               

Take a look at your STASH and see if anything is being output per timestep unnecessarily, or for example, that you are not outputting fields in terms of 48 time-steps rather than model days.

Annette

Annette

comment:7 Changed 5 years ago by charlie

Annette,

Sorry for the delay in getting back to you, I ran out of AU last week so it's taken me a while to get some more. Anyway, I now have some.

I have checked my stash, but not really sure what I'm looking for. Moreover, I have 624 fields in my stash, so is it a case of having to go through each and every one, or does the error tell you which one is causing the problem? The vast majority of my stash fields are output per day, not per timestep.

Charlie

comment:8 Changed 5 years ago by annette

Charlie,

Can you verify that you still get the error, because when I run a copy of your job it seems to work fine.

Annette

comment:9 Changed 5 years ago by charlie

Yes, I'm afraid so. I have just tried running my job again, starting in January 1978 with 72 timesteps, and it ran for about 10 minutes before failing and generating the following output file: /home/n02/n02/cjrw09/um/umui_out/xkmna000.xkmna.d15027.t111322.leave

So I don't understand why it runs for you.

Given that my run starts in 1971, and it seems unable to get past this point in 1978 (i.e. it hasn't done much of my 30 year run), would it be worth me just ditching what's been done already, and starting the run from scratch? As I said before, I have several other ensemble members of this run - they are identical apart from using a different start date - and they have all completed successfully.

Charlie

comment:10 Changed 5 years ago by annette

Charlie,

Sorry for the delay in looking at this.

It looks like you have had a tidy up of your home space, as I can't find the leave file you mentioned. If you see the error again, I can have a look at it.

I'm not sure whether you'd be better off restarting or not (without knowing what your recent problem was). So it's really up to you what you want to do.

Annette

comment:11 Changed 5 years ago by charlie

Oh dear, this is getting worse! I don't know what I've done wrong this time.

I thought I would start the run again, as I mentioned. So I deleted all the relevant directories, executables, etc and began from scratch. It has built the executable okay, but now doesn't reconfigure and isn't producing the xkmna.astart file. It did this fine the first time round, and nothing has changed. I'm pretty sure I have followed my own instructions (based on yours) correctly, but it's not working. I have the following error in my .leave file:

ERROR!!! in reconfiguration in routine Rcf_Files_Init
Error Code:- 10
Error Message:- Failed to Open Start Dump
Error generated from processor 0

Is this relevant, and if so what does it mean? The rest of my .leave file is at /home/n02/n02/cjrw09/um/umui_out/xkmna000.xkmna.d15035.t142846.leave

Charlie

comment:12 Changed 5 years ago by annette

Hi Charlie,

This error message means that the start file you are trying to use doesn't exist:

/work/n02/n02/cjrw09/dumps/xhgzha.dah110

So you need to change the job to point to the correct file and location.

I don't think you needed to delete the start file and executable in order to redo this run. I think we managed to get the start file set up OK previously.

Also in future, I'd suggest you create a new job in this situation, so you can keep track of what you've done.

Annette

comment:13 Changed 4 years ago by charlie

Annette,

Very many apologies for the delay in getting back to you.

Yes, sorry, schoolboy error - I had missed a character when referring to the start dump.

Anyway… I have now rerun the job, recreating everything from scratch. Unbelievably, it has again fallen over at exactly the same point - February 1978.

This is ridiculous! Why will xkmna not get past that point? As I said to you before, all of my other jobs in this family (i.e. xkmnb, xkmnc and xkmnd) are identical copies of xkmna, with the only difference being the original start dump. All of the other 3 get past this point and have completed successfully. I'm fairly sure the original start dump is okay, because I have used it many times before and the run has got past this point.

Please can you help?

Charlie

comment:14 Changed 4 years ago by annette

Charlie,

This is the same error that you had originally, and that we solved by changing the time step from 48/day to 72/day - see comment:3.

Annette

comment:15 Changed 4 years ago by charlie

Annette,

No, I never managed to resolve the problem by changing the timestep to 72/day, as this created conflicts with my stash (see comments 6-8). As I said, I have 624 fields in my stash, so do I really need to go through each and every one?!

You said later (comment 8) that you managed to run my job doing this? Could you tell me EXACTLY what steps you did to get my job running with 72 timesteps/day, which gets it past February 1978?

Charlie

comment:16 Changed 4 years ago by annette

Charlie,

The job I ran was xhnkl. Taking a difference with your job xkmna, the only change is to the timestep. The other differences - number of processors, run time, and hand-edit were to get the job into the debug queue.

I didn't see any STASH errors, and it ran out to 820 timesteps before running out of time (as it was in the 20 mins queue). I wonder if the STASH errors were to do with the debugging options we had in before (perhaps the irregular dumping?).

Can you try running with 72/timesteps again and see what happens? You seem to have had another error in comment:9 but I didn't get a look at this leave file.

Best wishes,
Annette

comment:17 Changed 4 years ago by charlie

Annette,

Sorry for the delay, my job has been stuck in the queue for the last week!

Anyway, it finally ran last night - this is starting in January 1978, with 72 Timesteps instead of 48, as you suggested. It fell over after just 20 minutes.

My latest output is at /home/n02/n02/cjrw09/um/umui_out/xkmna000.xkmna.d15063.t111922.leave

Would you mind taking a look?

Charlie

comment:18 Changed 4 years ago by annette

Charlie,

Ah this is the STASH error you were having!

ERROR detected in routine STWORK: stop model
: No. of output fields (= 5001 ) exceeds no. of reserved PP headers for unit  63
STASH    : Error processing diagnostic section  30 , item  404 , code  4
  STWORK   : NO. OF FIELDS EXCEEDS RESERVED HEADERS                               

The error message indicates the problem is too many fields being written to Unit 63, which is usage profile UPD3h1Y in your STASH. The reason this fails when the timestep increases is that 2 of these fields are being output every timestep: (30,403) and (30,404).

There are ways to increase the STASH headers to output more fields, but I would first verify that you actually want to output at this frequency.

Regards,
Annette

comment:19 Changed 4 years ago by charlie

Okay, understood.

Looking at my stash, 30,403 and 30,404 correspond to TOTAL COLUMN DRY MASS RHO GRID and TOTAL COLUMN WET MASS RHO GRID, respectively.

I can confirm that I don't need these, at this frequency - in fact, to be honest, I have no idea what they are.

Is it just a case of removing these 2, then resubmitting?

Charlie

comment:20 Changed 4 years ago by annette

Charlie,

Yes I think that should do the trick.

For future though, if you have concerns about the volume of data you are producing you might want to look through the other STASH items to see if there are fields you don't need. Wading through hundreds of STASH items is a pain, but you can group them by Time Profiles.

Hope this works now.

Annette

comment:21 Changed 4 years ago by charlie

Okay, I've now removed those 2 items, and have resubmitted my job. No doubt it will be at least 48 hours in the queue (judging on the last week), but I'll let you know as soon as anything happens.

Yes, I fully realise my stash is much larger than I need - I simply copied it from one of the standard jobs. I have 622 items, so going through them all will take forever! Plus, I'm reluctant to delete lots, in case I do need them after all!

Thanks a lot,

Charlie

comment:22 Changed 4 years ago by annette

  • Resolution set to fixed
  • Status changed from accepted to closed

I'm closing this ticket as there hasn't been any activity for 2 months. Please re-open or create a new ticket if you have any further issues.

Annette

Note: See TracTickets for help on using tickets.