Opened 3 years ago

Closed 3 years ago

#1946 closed help (completed)

Frustrated by rose stem submission failures

Reported by: ros Owned by: um_support
Component: UM Model Keywords: rose stem
Cc: Platform: MONSooN
UM Version: 10.5

Description

Hello Monsoon,

I have been trying since Friday to complete this rose stem test and each time it gets to these 4 tests and fails to submit.

$ qstat_snapshot | grep mricha
716123.xcm00    mricha     normal 
atmos_monsoon_xc40_n48_ga7_amip_10day_8x14.1.r26535_vn10.5_t7_columnarAeroCtl 
--       7   224    1gb 01:00 Q    --         Not Running: Insufficient 
amount of resource ncpus (R: 225 A: 148T: 3904)
716124.xcm00    mricha     normal 
atmos_monsoon_xc40_n48_ga7_amip_30day_8x14.1.r26535_vn10.5_t7_columnarAeroCtl 
--       7   224    1gb 01:30 Q    --         Not Running: Insufficient 
amount of resource ncpus (R: 225 A: 148T: 3904)
716125.xcm00    mricha     normal 
atmos_monsoon_xc40_n48_ga7_amip_10day_16x8.1.r26535_vn10.5_t7_columnarAeroCtl 
--       8   256    1gb 01:00 Q    --         Not Running: Insufficient 
amount of resource ncpus (R: 257 A: 148T: 3904)
716128.xcm00    mricha     normal 
atmos_monsoon_xc40_n48_ga7_amip_30day_16x8.1.r26535_vn10.5_t7_columnarAeroCtl 
--       8   256    1gb 01:30 Q    --         Not Running: Insufficient 
amount of resource ncpus (R: 257 A: 148T: 3904)

Can you help me understand why these do not get submitted?

Mark

Change History (4)

comment:1 Changed 3 years ago by ros

Hi Mark,

These jobs are being caught by a cylc event handler which says "If this job hasn't started running within 3hours of being submitted timeout". The rose stem suite then resubmits the "failed" task and repeats the cycle again.
See for example /home/mricha/cylc-run/r26535_vn10.5_t7_columnarAeroCtl/log/job/1/atmos_monsoon_xc40_n48_ga7_amip_10day_16x8/01/job-activity.log

You will need to find the suite definition file for the rose-stem suite and change the

submission timeout = PT3H

to something longer. I don't know which directory you're running rose-stem from so can't point you to the exact file to change.

Regards,
Ros.

comment:2 Changed 3 years ago by ros

Hi Ros
I suppose this means I have to run the WHOLE stem suite again.
I cannot just restart as that change is in:

/home/mricha/DevBranch/r26535_vn10.5_t7_columnarAeroCtl/rose-stem/rose-suite.conf

If I use rose suite-restart it will not process that file (I think).

That is all 113 stem test again and another day on Monsoon.

I am using rose stem --group=ukca

and need to add the trac.log to my Ticket before it goes to Sci/Review…

Mark

Last edited 3 years ago by ros (previous) (diff)

comment:3 Changed 3 years ago by ros

Hi Mark,

I think you should be able to run:

rose stem —group=ukca —restart

Which should reload the suite definition and restart from where it left off.

Cheers,
Ros.

comment:4 Changed 3 years ago by ros

  • Resolution set to completed
  • Status changed from new to closed
Note: See TracTickets for help on using tickets.