Opened 4 years ago
Closed 4 years ago
#1946 closed help (completed)
Frustrated by rose stem submission failures
Reported by: | ros | Owned by: | um_support |
---|---|---|---|
Component: | UM Model | Keywords: | rose stem |
Cc: | Platform: | MONSooN | |
UM Version: | 10.5 |
Description
Hello Monsoon,
I have been trying since Friday to complete this rose stem test and each time it gets to these 4 tests and fails to submit.
$ qstat_snapshot | grep mricha 716123.xcm00 mricha normal atmos_monsoon_xc40_n48_ga7_amip_10day_8x14.1.r26535_vn10.5_t7_columnarAeroCtl -- 7 224 1gb 01:00 Q -- Not Running: Insufficient amount of resource ncpus (R: 225 A: 148T: 3904) 716124.xcm00 mricha normal atmos_monsoon_xc40_n48_ga7_amip_30day_8x14.1.r26535_vn10.5_t7_columnarAeroCtl -- 7 224 1gb 01:30 Q -- Not Running: Insufficient amount of resource ncpus (R: 225 A: 148T: 3904) 716125.xcm00 mricha normal atmos_monsoon_xc40_n48_ga7_amip_10day_16x8.1.r26535_vn10.5_t7_columnarAeroCtl -- 8 256 1gb 01:00 Q -- Not Running: Insufficient amount of resource ncpus (R: 257 A: 148T: 3904) 716128.xcm00 mricha normal atmos_monsoon_xc40_n48_ga7_amip_30day_16x8.1.r26535_vn10.5_t7_columnarAeroCtl -- 8 256 1gb 01:30 Q -- Not Running: Insufficient amount of resource ncpus (R: 257 A: 148T: 3904)
Can you help me understand why these do not get submitted?
Mark
Change History (4)
comment:1 Changed 4 years ago by ros
comment:2 Changed 4 years ago by ros
Hi Ros
I suppose this means I have to run the WHOLE stem suite again.
I cannot just restart as that change is in:
/home/mricha/DevBranch/r26535_vn10.5_t7_columnarAeroCtl/rose-stem/rose-suite.conf
If I use rose suite-restart it will not process that file (I think).
That is all 113 stem test again and another day on Monsoon.
I am using rose stem --group=ukca
and need to add the trac.log to my Ticket before it goes to Sci/Review…
Mark
comment:3 Changed 4 years ago by ros
Hi Mark,
I think you should be able to run:
rose stem —group=ukca —restart
Which should reload the suite definition and restart from where it left off.
Cheers,
Ros.
comment:4 Changed 4 years ago by ros
- Resolution set to completed
- Status changed from new to closed
Hi Mark,
These jobs are being caught by a cylc event handler which says "If this job hasn't started running within 3hours of being submitted timeout". The rose stem suite then resubmits the "failed" task and repeats the cycle again.
See for example /home/mricha/cylc-run/r26535_vn10.5_t7_columnarAeroCtl/log/job/1/atmos_monsoon_xc40_n48_ga7_amip_10day_16x8/01/job-activity.log
You will need to find the suite definition file for the rose-stem suite and change the
submission timeout = PT3H
to something longer. I don't know which directory you're running rose-stem from so can't point you to the exact file to change.
Regards,
Ros.