Opened 4 months ago

Closed 3 months ago

#2252 closed help (completed)

Euro4km run running out of walltime xnotn

Reported by: sam89 Owned by: willie
Priority: highest Component: UM Model
Keywords: MPI Cc:
Platform: Monsoon2 UM Version: 8.2

Description

Hi Willie

My Euro 4km job seems to be running fine except that it runs out of walltime. It is currently running for 3 hrs 58 so I don't know how to stop it running out of wall time as if I increase the time limit anymore it says it cannot run at all.
I checked that the start dump, start time of the run and LBC file all start at the same time and they do. I tried changing the number of time steps but it complained that the number of timesteps was not an integral no of the length of the run (12 hours). I choose a timestep that was still devisible by the period though (1 day) so I am unsure why it does not like it. It seems to output a startdump at 18 UTC which is what I require and it seems ok when I open it but I don't trust it to not be corrupt in some way since the run says its fails since it runs out of wall time.

Is there any way to get it to run ok? I had this issue with my Euro 4km run previously but putting the job time limit to 14250 fixed the issue but this time it doesn't even though I am trying to run it for 12 hours this time instead of 24.
xnotn000.xnotn.d17232.t072636.leave
xnotn000.xnotn.d17231.t201746.leave

This run however also does use the startdump created in xnojc which I talk about in http://cms.ncas.ac.uk/ticket/2246#comment:7 as I am not sure if there is an issue with that dump…as I said though I think this Euro 4km run is working fine its just it is running out of walltime (I could be wrong though).

Thanks

Sam

Change History (8)

comment:1 Changed 4 months ago by willie

  • Owner changed from um_support to willie
  • Status changed from new to accepted

Hi Sam,

If there are no errors, you don't need to change the time steps. You just need to chop up the job into shorter runs that can complete in the queue time limit. Then do an NRUN followed by a CRUN. See http://cms.ncas.ac.uk/wiki/Docs/AutomaticResubmission.

However, you should focus on the issues in #2246 before worrying about this job.

Regards
Willie

comment:2 Changed 4 months ago by sam89

Hi Willie

Yes I figured I should fix the others first before worrying about this.

Thanks

Sam

comment:3 Changed 3 months ago by sam89

Hi Willie

Now I have the chain working (without the N216 job running yet) is there a reason why the Euro 4km job is running out of walltime even though only running for 12 hours? I previously had this job running fine for 24 hours (using a different .astart file) so I figured it must be an issue with the .astart file created from this chain but it is running, it just runs out of time. It does create the dump that I need at 18hrs, it must create it before running out of time but I don't trust that I can use this dump for what I need since the job doesn't run properly.
Getting this Euro 4km to run is the last stage in the chain as I need to calculate the difference between the dump at 18 hrs from this Euro 4km run and the Global run and then create an IAU file from this difference and add it back to the Global run and run it for 5 days…so if the dump from the Euro4km job is corrupt it will not be good…hope this makes sense.

Sam

comment:4 Changed 3 months ago by sam89

I forgot to say, I know you said to do a CRUN run but I don't see why I should need to if I am only running it for 12 hours and I previously had the job running for 24 hours (it did take the maximum walltime when running for 24 hours but it does mean it should run for 12 hours I would assume)

comment:5 Changed 3 months ago by willie

Hi Sam,

I've had a look at xnotn. I think the problem here is lack of processing power. The Euro4 domain is huge, 1100x1000 points, and the original IBM job (xitqb) had 1000 processors and SMT and OpenMP switched on. You should not use less than this. I think you should change to 36EWx30NS and switch the SMT and OpenMP back on. You may have to rebuild the executable. Compare with the original settings, and remember that Monsoon has 36 processor per node.

The time limit of the 'normal' queue is 4 hours, but you should not go close to this - set if to 3hours. To get started just reduce the run length down to one hour and see how long it takes. This should easily be completed in the 3hrs. Then you can scale up to the full run length later and maybe do CRUNS if it proves necessary.

The standard jobs like Euro4, when configured for a given computer, should just run. You shouldn't need to change the number of timesteps or any other parameter to get them to run.

Regards
Willie

comment:6 Changed 3 months ago by sam89

Hi Willie

i tried several times to get this to run over the weekend, for 12, 6 and 1 hours with a 3 hr 20 min time and having 36 x 30 processors and turning on SMT and openMP and it still does not even run for an hour without running out of wall time.
this is the most recent file where I tried to run it for an hour:
xnotn000.xnotn.d17246.t143300.leave

is there any other reason it may not even run for an hour?

Thanks

Sam

comment:7 Changed 3 months ago by willie

Hi Sam,

I took a copy of your xnota/xnotn and compiled for "high" optimisation instead of debug. I was able to run 432 time steps (12hr) in about 8 minutes. However, despite all the processes exiting, the job does run on until the wall time is exhausted. I have not been able to find out why - it is not due to STASH and it is not due to IO servers, so I suspect it is some sort of MPI problem (see my UMUI experiment xnpt).

But the good news is that your job is running properly apart from this. Compile the executable for "high" optimsation. When you do the actual run, you just need to check that the pe0 file has completed the desired number of time steps, and that all the pes have exited - it's the last line in the pe file. If that's the case then you're good to go. You should then delete the still running job from the queue using the qdel command, something like

qdel 5855614.xcs00

(so the walltime should be set quite short)

Regards
Willie

comment:8 Changed 3 months ago by willie

  • Keywords MPI added
  • Resolution set to completed
  • Status changed from accepted to closed
Note: See TracTickets for help on using tickets.