Opened 6 weeks ago

Last modified 6 weeks ago

#3246 new help

Job - not moving past "submitted"

Reported by: NoelClancy Owned by: pmcguire
Component: JULES Keywords:
Cc: Platform:
UM Version:

Description

Patrick,

How can you re-submit a job in Global JULES if it seems to be stuck in the "submitted" phase of a given job.

Attachments (1)

Capture_u-bt273.PNG (46.6 KB) - added by NoelClancy 6 weeks ago.

Download all attachments as: .zip

Change History (12)

Changed 6 weeks ago by NoelClancy

comment:1 Changed 6 weeks ago by pmcguire

Hi Noel
What does it say in the job-activity.log file?

Sometimes it can take 1 or 2 or 3 days for a submitted job to start running.

If you want to resubmit the job and wait in the queue again, you can try to kill the spin-05 job by right-clicking with the mouse on the spin-05 icon and killing that sub-job. Then you can retrigger it by right-clicking on the red icon and selecting the trigger now option. Does that work?

Or maybe it's worth waiting longer, depending on the log messages in the various log files.

You can also use the bjobs to get more information about a job, especially by using bjobs -l JOBID, where JOBID is the job ID number of the job.

Patrick

comment:2 Changed 6 weeks ago by NoelClancy

cd /home/users/nmc/cylc-run/u-bt273/log/job/17000101T0000+01/spinup_05/01
vi job-activity.log

[jobs-submit ret_code] 0
[jobs-submit out] 2020-04-15T15:50:58+01:00|17000101T0000+01/spinup_05/01|0|9961985
2020-04-15T15:50:58+01:00 [STDOUT] Job <9961985> is submitted to queue <par-single>.
[jobs-poll ret_code] 0

comment:3 Changed 6 weeks ago by NoelClancy

bjobs

JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
9961985 nmc PEND par-single jasmin-cylc *1T0000+01 Apr 15 15:50

comment:4 Changed 6 weeks ago by NoelClancy

bjobs -l 9961985

Job <9961985>, Job Name <u-bt273.spinup_05.17000101T0000+01>, User <nmc>, Proje

ct <default>, Status <PEND>, Queue <par-single>, Command <
#!/bin/bash -l;#;# ++++ THIS IS A CYLC TASK JOB SCRIPT +++
+;# Suite: u-bt273;# Task: spinup_05.17000101T0000+01;# Jo
b log directory: 17000101T0000+01/spinup_05/01;# Job submi
t method: lsf;# Execution time limit: 172800.0; # DIRECTIV
ES:;#BSUB -J u-bt273.spinup_05.17000101T0000+01;#BSUB -o /
home/users/nmc/cylc-run/u-bt273/log/job/17000101T0000+01/s
pinup_05/01/job.out;#BSUB -e /home/users/nmc/cylc-run/u-bt
273/log/job/17000101T0000+01/spinup_05/01/job.err;#BSUB -W

2880;#BSUB -q par-single;#BSUB -n 16;export CYLC_DIR='/ap

ps/contrib/metomi/cylc-7.8.1';export CYLC_VERSION='7.8.1';
export ROSE_VERSION='2019.01.0';CYLC_FAIL_SIGNALS='EXIT ER
R XCPU TERM INT SIGUSR2'; cylcjobinstcylc_env() {;

# CYLC SUITE ENVIRONMENT:; export CYLC_CYCLING_MODE="

gregorian"; export CYLC_SUITE_FINAL_CYCLE_POINT="201812
31T2359+01"; export CYLC_SUITE_INITIAL_CYCLE_POINT="170
00101T0000+01"; export CYLC_SUITE_NAME="u-bt273"; ex
port CYLC_UTC="False"; export CYLC_VERBOSE="false";

export CYLC_SUITE_RUN_DIR="/home/users/nmc/cylc-run/u-bt2

73"; export CYLC_SUITE_DEF_PATH="${HOME}/cylc-run/u-bt2
73"; export CYLC_SUITE_DEF_PATH_ON_SUITE_HOST="/home/us
ers/nmc/cylc-run/u-bt273"; export CYLC_SUITE_UUID="89ef
6e22-f267-47ee-87a4-758addc17f93"; # CYLC TASK ENVIRON
MENT:; export CYLC_TASK_JOB="17000101T0000+01/spinup_05
/01"; export CYLC_TASK_NAMESPACE_HIERARCHY="root JASMIN

JULES_JASMIN JULES SPINUP spinup_05"; export CYLC_TASK

_DEPENDENCIES="spinup_04.17000101T0000+01"; export CYLC
_TASK_TRY_NUMBER=1;}; cylcjobinstuser_env() {; #
TASK RUNTIME ENVIRONMENT:; export JULES_REVISION JULES_
FCM ROSE_SUITE_NAME OMP_NUM_THREADS ROSE_LAUNCHER ROSE_TAS
K_APP OUTPUT_FOLDER DATA_DIREC ANCIL_DIREC ANCIL_TIME_DIRE
C RUN_ID_STEM SPIN_END ROSE_APP_OPT_CONF_KEYS MAIN_TASK_ST
ART MAIN_TASK_END ID_STEM ID_STEM2 INITFILE USE_FILE DUMPF
ILE; JULES_REVISION="vn5.4"; JULES_FCM="/home/users/
nmc/jules/vn5.4"; ROSE_SUITE_NAME="$CYLC_SUITE_NAME";

OMP_NUM_THREADS="2"; ROSE_LAUNCHER="mpirun.lotus";

ROSE_TASK_APP="jules"; OUTPUT_FOLDER="/work/scratch/$U

SER/$ROSE_SUITE_NAME"; DATA_DIREC="/gws/nopw/j04/ncas_g
eneric/OzDamDriveData/"; ANCIL_DIREC="/gws/nopw/j04/nca
s_generic/OzDamDriveData/HadGEM3-GA6/Ancils/GA7_Ancil/";

ANCIL_TIME_DIREC="$ANCIL_DIREC"; RUN_ID_STEM="JULES-E

S.1p0.vn5.4.50.CRUJRA2.TRENDYv8.365"; SPIN_END="$(rose
date —calendar gregorian $CYLC_TASK_CYCLE_POINT -s P20Y -
f '%Y%m%d')"; ROSE_APP_OPT_CONF_KEYS="spinup"; MAIN_
TASK_START="$(rose date —calendar gregorian $CYLC_SUITE_I
NITIAL_CYCLE_POINT -f "'%Y-%m-%d %H:%M:%S'")"; MAIN_TAS
K_END="$(rose date —calendar gregorian $CYLC_SUITE_INITIA
L_CYCLE_POINT -s P20Y -f "'%Y-%m-%d %H:%M:%S'")"; ID_ST
EM="$RUN_ID_STEM.$(basename $CYLC_TASK_NAME ${CYLC_TASK_NA
ME})"; ID_STEM2="$RUN_ID_STEM.$(basename $CYLC_TASK_NAM
E ${CYLC_TASK_NAME##*_})"; INITFILE="${OUTPUT_FOLDER}/$
{ID_STEM2}04.dump.${SPIN_END}.0.nc"; USE_FILE=".true.";

DUMPFILE=".true.";}; cylcjobinstenv_script() {;#

ENV-SCRIPT:;eval $(rose task-env);module add parallel-net

cdf/intel;module list 2>&1;env | grep LD_LIBRARY_PATH;expo
rt LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$HDF5_LIBDIR;env | gre
p LD_LIBRARY_PATH;}; cylcjobinstscript() {;# SCRIPT:
;mkdir -p $OUTPUT_FOLDER;rose task-run —quiet —path=shar
e/fcm_make/build/bin;}; . "${CYLC_DIR}/lib/cylc/job.sh";cy
lc
jobmain; #EOF: 17000101T0000+01/spinup_05/01>

Wed Apr 15 15:50:58: Submitted from host <jasmin-cylc.ceda.ac.uk>, CWD </>, Out

put File </home/users/nmc/cylc-run/u-bt273/log/job/1700010
1T0000+01/spinup_05/01/job.out>, Error File </home/users/n
mc/cylc-run/u-bt273/log/job/17000101T0000+01/spinup_05/01/
job.err>, 16 Processors Requested;

RUNLIMIT
2880.0 min of jasmin-cylc.ceda.ac.uk
PENDING REASONS:
Just started a job recently: 94 hosts;
Job slot limit reached: 54 hosts;
The 1 min effective CPU queue length (r1m) is beyond threshold: 26 hosts;
Not specified in job submission: 154 hosts;
Unable to reach slave batch server: 10 hosts;
Load information unavailable: 30 hosts;
Closed by LSF administrator: 9 hosts;

SCHEDULING PARAMETERS:

r15s r1m r15m ut pg io ls it tmp swp mem

loadSched - 0.9 - - - - - - - - -
loadStop - - - - - - - - - - -

RESOURCE REQUIREMENT DETAILS:
Combined: select[type == local] order[r15s:pg] span[hosts=1] same[nodetype]
Effective: -

comment:5 Changed 6 weeks ago by NoelClancy

The number of processors is set as 16 in the rose-app.conf file
Is that a problem?

job.err>, 16 Processors Requested (from above)

How would you know if it would be worth waiting longer?

comment:6 Changed 6 weeks ago by pmcguire

Hi Noel:
If you're asking for 16 processors on the par-single queue, this is trying to get 16 processors on a single LOTUS batch node of JASMIN.

In the past, I have had sometimes to wait 1 or 2 or 3 days (or longer) for the submitted job to start running. This waiting period was often higher during the week, as compared to the weekend.

It's probably worth waiting longer.

But if all your other par-single jobs are getting properly submitted right now, then maybe you can kill the spinup_05 task and then retrigger it, with the right-mouse click on the icon. But you will have to re-wait in the par-single queue if you do that. Maybe the par-single queue won't make you wait so long, if you try retriggering it.

Patrick McGuire

comment:7 Changed 6 weeks ago by NoelClancy

This is the only single-par job that I have running at this moment.
But the first six jobs (fcm_make, RECON, spin-up 1, spin-up 2, spin-up 3, and spin-up 4) on u-bt273 ran without this delay so I'm not sure.
This time I can wait as the weekend is coming now. maybe the queues will become less busy.

The other suite, u-bs482 is running on the par-multi queue with the -R = span[hosts=1] option.
For future runs, would you recommend the par-multi queue if I will be using 16 processors?

comment:8 Changed 6 weeks ago by pmcguire

Hi Noel:
Yes, sometimes the queues clog up, so maybe it didn't get clogged up until spinup_05.

Or maybe your priority was lowered after a while due to running a lot of jobs. This could be so, but I don't think this is the case.

It might be worthwhile trying par-multi with 16 processors and -R = span[hosts=1]. That might be faster than par-single.

You can try this right now, by making a copy of the suite, and then making the changes in the previous paragraph (and changing the output file names or directory), and then rerunning from the beginning. JASMIN can handle multiple suites running from the same user, as long as you have enough disk space, and you're not running too many suites at the same time.
Patrick

comment:9 Changed 6 weeks ago by NoelClancy

(base) [nmc@jasmin-cylc nmc]$ pwd
/work/scratch/nmc
(base) [nmc@jasmin-cylc nmc]$ du -sh
3.1T .

This is quite high ?
What limit do I have on the scratch?

comment:10 Changed 6 weeks ago by NoelClancy

I have 1.2 T in config
This was from the GLOBAL JULES online tutorial, maybe I can delete these output files now.

(base) [nmc@jasmin-cylc nmc]$ ls
config cylc-run fluxnet logs u-bs482 u-bt273
(base) [nmc@jasmin-cylc nmc]$ du -sh
3.1T .
(base) [nmc@jasmin-cylc nmc]$ cd config/
(base) [nmc@jasmin-cylc config]$ du -sh
1.8T .
(base) [nmc@jasmin-cylc config]$ cd ..
(base) [nmc@jasmin-cylc nmc]$ cd cylc-run/
(base) [nmc@jasmin-cylc cylc-run]$ du -sh
6.5G .
(base) [nmc@jasmin-cylc cylc-run]$ cd ..
(base) [nmc@jasmin-cylc nmc]$ cd fluxnet/
(base) [nmc@jasmin-cylc fluxnet]$ du -sh
94G .
(base) [nmc@jasmin-cylc fluxnet]$ cd ..
(base) [nmc@jasmin-cylc nmc]$ cd logs/
(base) [nmc@jasmin-cylc logs]$ du -sh
23G .
(base) [nmc@jasmin-cylc logs]$ cd ..
(base) [nmc@jasmin-cylc nmc]$ cd u-bs482
(base) [nmc@jasmin-cylc u-bs482]$ du -sh
1.2T .
(base) [nmc@jasmin-cylc u-bs482]$ cd ..
(base) [nmc@jasmin-cylc nmc]$ cd u-bt273/
(base) [nmc@jasmin-cylc u-bt273]$ du -sh
6.0G .

comment:11 Changed 6 weeks ago by pmcguire

Hi Noel
3.1TB on scratch on JASMIN is not super-super-high. But it is rather high. Maybe at some point somebody from CEDA will ask you to move your stuff off scratch.

You can look to see how much is left on scratch right now for everybody with either pan_df or df:

[pmcguire@jasmin-sci1 ~]$ pan_df -H /work/scratch
Filesystem             Size   Used  Avail Use% Mounted on
panfs://panmanager02.jc.rl.ac.uk/cache/lotus/scratch
                       100T    83T    18T  83% /work/scratch

Right now, there is 18TB left, so maybe there is plenty of room for your work there at the moment.

Patrick

Last edited 6 weeks ago by pmcguire (previous) (diff)
Note: See TracTickets for help on using tickets.