Opened 10 months ago

Closed 9 months ago

#3139 closed help (completed)

u-bq336: suite clogs my archer processes

Reported by: xd904476 Owned by: um_support
Component: UM Model Keywords:
Cc: Platform:
UM Version:

Description

Hi,
I am running this suite with 30 ensemble with a limit on the processes that can go on together in suite.rc
{% if SITE == 'archer' %}

queues?

[parallel_queue?]

limit = 10
members = COUPLED

[serial_queue?]

limit = 4
members = PERTURB_RESOURCE, POSTPROC

{% endif %}

This doesn't seem to be enough: the suite submits all the coupled tasks right after the perturbations and even before the recon.
This is generally ok with only a few ensembles, but with this number, all the jobs on archer are now clogged. I have set all the coupled tasks to "failed" to run the perturbations and then the recon manually.
I'll try to sort this out for this case, but could you please tell me how to add some "order" in suite.rc so that the RECON tasks is only triggered after all the perturbations have succeeded?
Thanks,
Dani

Change History (7)

comment:1 Changed 9 months ago by xd904476

Update: I have manually created all the perturbed initial conditions and I have retriggered the RECOn task.
The suite is now not recognising ASTART and I also I am not sure that it is reading the right ice initial condition by reading the job.out file (it should read /work/n02/n02/dflocco/startdump/be699i.restart.2015-01-01-00000.nc).
Could you help please, otherwise I'have to quickly start to run the 30 suites manually for the experiment.
Thanks,
Dani

comment:2 Changed 9 months ago by dcase

Dani, I can see that you have 30 start dumps, and your job files for the coupled runs include a variable to pick this up. Could you trigger one of the coupled jobs or point me to a job.err file which shows ASTART problem?

Thanks.

comment:3 Changed 9 months ago by xd904476

Hi Dave,
The suite stops at the RECON tasks. The job.err tells me that astart is unbound.
/home/xd904476/cylc-run/u-bq336/log/job/20150101T0000Z/recon/10/job.err

I can trigger a coupled task, but i don’t know what it would use as initial conditions.

Shall i trigger the recon again or a coupled task?
Cheers,
Dani

comment:4 Changed 9 months ago by dcase

Ok. The recon step only happens once, and then the ensembles will do their perturbations and make their own files. If you give a variable in suite.rc ([[recon]][[[environment]]]) you can put:

ASTART = $ROSE_DATA/$RUNID.astart

as was the case for perturb<ensemble> step.

Last edited 9 months ago by dcase (previous) (diff)

comment:5 Changed 9 months ago by xd904476

Ah ok. I got it. Then the recon needs to be run before all the perturbations.It was confusing with all the perturbation starting at the same time.
i'll try again by holding the perturbations and the coupled tasks.

thanks
dani

comment:6 Changed 9 months ago by xd904476

Hi Dave, no luck.
The only thing I changed compared to last week when the suite run properly is the number of ensembles.
Shall I try deleting everything on archer rather than only restart it as new? Perhaps something is stuck?

In any case, I have now added astart also in the recon environment and I'll restart it.
thanks

comment:7 Changed 9 months ago by xd904476

  • Resolution set to completed
  • Status changed from new to closed

The ensemble size needs to be smaller than 16. Probably more changes to suite.rc are needed to overcome this issue and have the ensemble handling the queues by itself.

Best,
Dani

Note: See TracTickets for help on using tickets.