Opened 8 years ago

Closed 8 years ago

#825 closed help (fixed)

Job time limit

Reported by: a.elvidge Owned by: ros
Component: UM Model Keywords: CRUN Automatic Resubmisison
Cc: Platform:
UM Version: 7.6

Description

Hi,

Is there a max value allowed for the job time limit (in the Job Submission umui window)? I submitted a job with a time limit of 20 hours (very long I know - it may only take 12 hours, but I didn't want to have to wait 12 hours and then find it hadn't quite finished), the reconfiguration seemed to complete (with no errors in the output file) but then the model run simply didn't start afterwards. Since the job ran fine before changing the run length (48 hours to 60 hours) and job time limit (12 hours to 20 hours), I can only assume that it is the latter which is causing the problem?

Thanks, Andy

Change History (15)

comment:1 Changed 8 years ago by a.elvidge

I notice that I got the following email in relaiton to the job failure:

PBS Job Id: 595120.sdb
Job Name:   xfxkb_run
Aborted by PBS Server
Job rejected by all possible destinations

Thanks, Andy

comment:2 Changed 8 years ago by ros

Hi Andy,

The maximum time limit for the queues on HECToR is 12 hours.

To run simulations that need longer that 12 hours you will need to use automatic resubmission. For instructions see the CMS website: http://cms.ncas.ac.uk/index.php/um-documentation/ncas-user-guides/37

Cheers,
Ros.

comment:3 Changed 8 years ago by a.elvidge

Hi Ros,

Thanks. My job still disappears, despite reducing the time limit to 12 hrs, but now without an accompanying email. Could it be that I've made too many job submissions in too short a time period? (I tried running it numerous times with different settings over the past couple of hours)

Thanks, Andy

comment:4 Changed 8 years ago by ros

  • Owner changed from um_support to ros
  • Status changed from new to accepted

Hi Andy,

There's no limit on the number of times you submit jobs. The last 4 times you've submitted the job to compile it's failed because a lock file exists.

As you've been frantically submitting jobs it's hard for me to tell what is going on.
Please remove the /home/n02/n02/aelvidge/compile/xfxkb/ummodel/fcm.bld.lock as indicated in the .comp.leave file and resubmit it to compile and run.

We'll see what happens then and take it from there.

Cheers,
Ros.

comment:5 Changed 8 years ago by ros

P.s. Just looking at you job you still have a job time limit set for 13hours (46800). Change this to 43200 or less.

comment:6 Changed 8 years ago by a.elvidge

Thanks

comment:7 Changed 8 years ago by a.elvidge

Hi Ros,

I'm having trouble with automatic resubmission. I thought I'd followed the instructions, but it hasn't worked… my second run has simply overwritten my first rather than continuing on from it.

One of the instructions is:

If your initial run was a compile and run job in one job then you need also to change to
STEP=4

there was instance in the submit script which included a line "STEP=", so I added a new line below TYPE=CRUN. Is this correct? My job (xfxkb) includes build and reconfiguration steps.

Thanks, Andy

comment:8 Changed 8 years ago by ros

Hi Andy.

No that's not correct, unfortunately. You need to change the existing line TYPE=NRUN near the top of the SUBMIT script to be TYPE=CRUN. Adding the line where you did will have no effect, as you have discovered.

Regards,
Ros.

comment:9 Changed 8 years ago by a.elvidge

Hi Ros,

Yes, I did change TYPE=NRUN to TYPE=CRUN. Then, underneath that line I added 'STEP=4', in accordance to the instruction "If your initial run was a compile and run job in one job then you need also to change to STEP=4" (an instruction I am perhaps misunderstanding?).

Thanks, Andy

comment:10 Changed 8 years ago by ros

Hi Andy,

Sorry I couldn't tell which way to interpret your comment as when I looked in your SUBMIT script no changes had been made.

You need to change the existing STEP= line, NOT add one, to be STEP=4. Do a search for STEP= and you'll find it a lot further down the file. It's the occurrence that is in the section "run_header". Looking at your xfxkb SUBMIT script STEP is already set to 4 so you now just need to change TYPE=NRUN to TYPE=CRUN and press the UMUI SUBMIT button.

REgards,
Ros.

comment:11 Changed 8 years ago by a.elvidge

Ah, my error was that I made the edit prior to processing the job. Thanks, Andy

comment:12 Changed 8 years ago by a.elvidge

Hi Ros,

I've tried running the job continuation, but am getting the following error in the umui submission box:
"You have selected a complilation step and a continuation run CRUN. This in not allowed. Please modify your UMUI settings"

A segment of my SUBMIT script looks like this:

# Compilation ===================

cat $comp_header >>$comp_script
cat $JOBDIR/COMP_SWITCHES >>$comp_script
cat $JOBDIR/FCM_BLD_COMMAND >>$comp_script

# Reconfiguration =============

cat $rcf_header >>$rcf_script
cat >>$rcf_script<<EOF

export PART=RUN
export RCF_NEW_EXEC=false
export STEP=99
EOF

# Run =========================

cat $run_header >>$run_script
cat >>$run_script<<EOF

export PART=RUN
export RCF_NEW_EXEC=false
export STEP=4

Should the 'STEP=99' be 'STEP=4'? (note that I have a 'STEP=4' under the # Run section).

Thanks, Andy

comment:13 Changed 8 years ago by a.elvidge

I just tried changing the 'STEP=99' be 'STEP=4' but am getting the same error.

comment:14 Changed 8 years ago by ros

Hi Andy,

You need to go into the UMUI panel Compilations and Modifications —> Compile options for the UM Model and turn off the compilation step.

Select Run from existing executable as named below

Probably best to do the same in the reconfiguration window, although I'm not sure that's necessary, but would be safest.

Process the job and then re-edit the SUBMIT script to change TYPE=NRUN to TYPE=CRUN.

Cheers,
Ros.

comment:15 Changed 8 years ago by ros

  • Keywords CRUN Automatic Resubmisison added
  • Resolution set to fixed
  • Status changed from accepted to closed
Note: See TracTickets for help on using tickets.