Opened 5 months ago

Closed 4 weeks ago

#3235 closed help (fixed)

TERM_RUNLIMIT JULES

Reported by: NoelClancy Owned by: pmcguire
Component: JULES Keywords:
Cc: Platform:
UM Version:

Description

job.err
User defined signal 2
MPI Application rank 5 killed before MPI_Finalize() with signal 12
mpirun: propagating signal 12
2020-03-24T21:45:47Z CRITICAL - failed/SIGUSR2

job.out
TERM_RUNLIMIT: job killed after reaching LSF run time limit.
Exited with exit code 1.

Resource usage summary:

CPU time : 171298.00 sec.
Max Memory : 426 MB
Average Memory : 423.91 MB
Total Requested Memory : -
Delta Memory : -
Max Swap : 8483 MB
Max Processes : 24
Max Threads : 26
Run time : 10802 sec.
Turnaround time : 12257 sec.

Attachments (2)

unnamed.png (65.5 KB) - added by pmcguire 4 months ago.
unnamed-2.png (45.4 KB) - added by pmcguire 4 months ago.

Download all attachments as: .zip

Change History (27)

comment:1 Changed 5 months ago by NoelClancy

  • Component changed from UM Model to JULES
  • Owner changed from um_support to pmcguire

Hi Patrick,

I think I need to allocate more time to the jobs.

comment:2 Changed 5 months ago by NoelClancy

I just checked in the suite, u-bs482
fcm ci

comment:3 Changed 5 months ago by pmcguire

Hi Noel:
Yes, you're right, you need to allocate more time to the jobs. That indeed is what the error messages say.

You'll see in your ~nmc//roses/u-bs482/rose-suite.conf file these variables:

WALLTIME_RUN='PT3H'
WALLTIME_SPINUP='PT3H'

That means for the main run, the limit is 3 hours for the running wall-clock time. And likewise for the spinup.

Based on prior experience with the suite u-bb316 GL7 suite, I would expect that this suite could work with PT48H instead of PT3H. It might be able to work with less than 48 hours of wall-clock time, but I am not sure right now. The 48 hours is the maximum possible for the par-single queue on JASMIN. I think that the suite will restart itself periodically,

For the par-single queue, it might take a while for the job to start after it is submitted into the queue. It's trying to reserve 8 processors on a single node, and the scheduler for JASMIN figures out when to start the job running. If this becomes prohibitive, then maybe the suite can be changed to use the par-multi queue but with the -R = "span[hosts=1]" option.

Do you really want to have 50 spinup cycles of 20 years each? That might take much longer than 48 hours. I don't know if the suite will restart itself properly during spinup or not if the total spinup or one cylce of spinup is not finished within the wall clock time.

Patrick

comment:4 Changed 5 months ago by NoelClancy

Hi Patrick,

I followed your recommendations last night and the suite successfully passed the fcm_make and the RECON stages fairly quickly.

It has completed the first three spin-up cycles and is running spin up 4

Thanks, and also for the explanation, that's good to have. I'm making a log of all my past tickets so I can go back and read them if the same problems come up again.

Really great help Patrick !!!

comment:5 Changed 5 months ago by NoelClancy

Ticket closed for now, Thanks

comment:6 Changed 4 months ago by NoelClancy

Patrick,

Suite, u-bs482 completed S0, S2, S3 and S4 so I backed up the output files from those runs.
Then I corrected the error in S1 and when I try to run the suite, it doesn't get past the RECON.

What did you mean by following?

"For the par-single queue, it might take a while for the job to start after it is submitted into the queue. It's trying to reserve 8 processors on a single node, and the scheduler for JASMIN figures out when to start the job running. If this becomes prohibitive, then maybe the suite can be changed to use the par-multi queue but with the -R = "span[hosts=1]" option."

I see in suite.rc

[directives?]

-q = par-single

But I don't know what you mean by:

-R = "span[hosts=1]" option

comment:7 Changed 4 months ago by NoelClancy

job.err:

Environment variables set for netCDF Fortran bindings in

/apps/libs/netCDF/intel14/fortran/4.2/

You will also need to link your code to a compatible netCDF C library in

/apps/libs/netCDF/intel14/4.3.2/

/bin/sh: rose-jules-run: command not found
[FAIL] rose-jules-run <<'STDIN'
[FAIL]
[FAIL] 'STDIN' # return-code=127
2020-04-13T11:40:38+01:00 CRITICAL - failed/EXIT

comment:8 Changed 4 months ago by NoelClancy

Exited with exit code 1.

Resource usage summary:

CPU time : 3.33 sec.
Max Memory : -
Average Memory : -
Total Requested Memory : -
Delta Memory : -
Max Swap : -
Max Processes : -
Max Threads : -
Run time : 8 sec.
Turnaround time : 1148 sec.

comment:9 Changed 4 months ago by NoelClancy

job.out

comment:10 Changed 4 months ago by NoelClancy

When you say "maybe the suite can be changed to use the par-multi queue but with the -R = "span[hosts=1]" option" do you mean change the 'suite.rc' from

[directives?]

-q = par-single
-n = {{ MPI_NUM_TASKS }} # Set as 16 in 'rose-suite.conf'

to

[directives?]

-q = par-multi
-R = "span[hosts=1]"
-n = {{ MPI_NUM_TASKS }} # Set as 16 in 'rose-suite.conf'

comment:11 Changed 4 months ago by pmcguire

yes, that's what I meant, Noel!
I'm glad you figured out what I meant.

You can use the bqueues command and the bjobs -l 7722673 command to get some info about how busy the queues are and about any issues with your particular job.

Patrick

comment:12 Changed 4 months ago by NoelClancy

Patrick,

First of all, I checked out a copy as a clean suite, u-bt273 with all the original settings.
This clean suite (u-bt273) now has all the previous errors corrected and is running through the spin-ups at the moment. I'm hoping that once it finishes the 50 spin-ups it will go on to the MAIN RUN and finish successfully. I will find out in about a week's time.

This is my insurance policy. Is it possible that u-bt273 will fail at the MAIN RUN stage now?

In relation to u-bs482, I tried what you recommended (above) with suite u-bs482 but it didn't work.

Would I have messed things up by doing a suite-clean on u-bs482?


comment:13 Changed 4 months ago by NoelClancy

bjobs -l 7722673

What does the 7722673 mean?
Unique to my job?

comment:14 Changed 4 months ago by NoelClancy

How did you find 7722673?

comment:15 Changed 4 months ago by pmcguire

Hi Noel
To address two of your questions:
1)
Can you tell me what happened when you "tried what [I] recommended (above) with suite u-bs482 but it didn't work"?

2)
The 7722673 is your job ID on JASMIN

I looked at your log files, and it was in there. Do you see it in there?

There are other ways to find your job ID for a particular job, as well.
Patrick

comment:16 Changed 4 months ago by pmcguire

Hi Noel
Another way to get the job ID number on JASMIN is to use either bjobs or bjobs -u nmc. It will show your job ID number then.
Patrick

comment:17 Changed 4 months ago by NoelClancy

Hi Patrick,

1)
I tried:

q = par-multi
-R = "span[hosts=1]"
-n = {{ MPI_NUM_TASKS }} # Set as 16 in 'rose-suite.conf'

but nothing happened

Then I tried to set MPI_NUM_TASKS = 8 in 'rose-suite.conf' but that failed.

Then I reverted back to all the original settings and ran u-bs482 from scratch.

LSPINUP=false and BUILD=false in 'rose-suite.conf'

And in 'suite.rc'

q = par-single
-n = {{ MPI_NUM_TASKS }} # Set as 16 in 'rose-suite.conf'
NITFILE = "${OUTPUT_FOLDER}/${RUN_ID_STEM}.${SPINDUMP}.dump.${DUMPTIME}.0.nc"

It took just 8 minutes for the suite to successfully complete

FCM_MAKE
RECON

Then as it submitted the first SPINUP job, I stopped the suite and am trying to restart it from

INITFILE = "${OUTPUT_FOLDER}/${RUN_ID_STEM}.spinup_50.dump.17200101.0.nc"

LBUILD=false
LSPINUP=false

But it has been submitted to the RECON stage for 2 hours and still not running or failed.

2)
I found the job ID '7728073'
cd /home/users/nmc/cylc-run/u-bs482/log/job/17000101T0000+01/RECON/01
vi vi job-activity.log
[jobs-submit out] 2020-04-14T15:45:44+01:00|17000101T0000+01/RECON/01|0|7728073

comment:18 Changed 4 months ago by NoelClancy

ok i've made an error because its still waiting in the par-single queue

I need to stop the suite and run on the par-multi queue

comment:19 Changed 4 months ago by NoelClancy

I'm somewhat confused as to why I was able to restart S0, S2, S3 and S4 (FOUR simulations) from the last spinup dump file

spinup_50.dump.17200101.0.nc
LBUILD=false
LSPINUP=false

with the following settings

-q = par-single
-n = {{ MPI_NUM_TASKS }} # Set as 16 in 'rose-suite.conf'

In any case, I need to learn how to restart suites from the last spin-up dump, otherwise every future experiment I do will require going through the whole spin-up phase.

But when I tried to restart S0, S1, S2, S3 and S4 (FIVE simulations) it seems to be too much for

-q = par-single

-n = {{ MPI_NUM_TASKS }} # Set as 16 in 'rose-suite.conf'

Is that possible? That 4 parallel simulations can be run on the par-single queue but 5 simulations is too much?

Or is it simply a matter of luck and good timing as to when a job is submitted to the queue that will determine whether it is successful or not??


comment:20 Changed 4 months ago by NoelClancy

RECON successful

suite running now

comment:21 Changed 4 months ago by NoelClancy

Patrick,

S0, S2 and S3 have ran through 1700-1900 and are still running so hopefully will run until 2020.

However, S1 and S4 failed 10 year re-submission at 1770 and 1760 respectively. However the colour is pink (not red) in the cylc GUI. What does this mean? Will those simulations continue running at some point?

comment:22 Changed 4 months ago by pmcguire

Hi Noel:
In the cylc GUI, next to the pink-coloured icons for S1 for 1770 and for S4 for 1760 (see the attachments), it says that the 'state' is 'submit-failed'. This means that the submission of the jobs to the JASMIN LOTUS batch-job submission system failed for some reason.
The jobs never started to run.

What do the job-activity.log log files say for the cause of the error?

You can try to resubmit each of the 2 failed jobs by right clicking with the mouse in the cylc GUI, and press the 'trigger now' option for the S1 1770 and S4 1760 jules jobs. Does that work?
Patrick McGuire

Last edited 4 months ago by pmcguire (previous) (diff)

Changed 4 months ago by pmcguire

Changed 4 months ago by pmcguire

comment:23 Changed 4 months ago by NoelClancy

Yes, that works. Thanks.
Ticket closed for now.

comment:24 Changed 4 weeks ago by pmcguire

  • Status changed from new to accepted

comment:25 Changed 4 weeks ago by pmcguire

  • Resolution set to fixed
  • Status changed from accepted to closed
Note: See TracTickets for help on using tickets.