Opened 4 years ago

Closed 4 years ago

#1805 closed help (wontfix)

v8.4 UM-UKCA jobs failing to complete build & compile in 1hour-wallclock on serial queue

Reported by: gmann Owned by: um_support
Component: UM Model Keywords: ukca
Cc: s.s.dhomse@… Platform: ARCHER
UM Version: 8.4

Description

Dear NCAS-CMS helpdesk,
cc: Sandip

I am encountering a problem this evening where the build-and-compile
job for two of my v8.4 UM-UKCA ARCHER jobs I've submitted this evening
are not completing within the 1:00:00 wall-clock limit for the serial nodes.

/home/n02/n02/gmann/output/xmhbw000.xmhbw.d16035.t191613.comp.leave

/home/n02/n02/gmann/output/xmhbx000.xmhbx.d16035.t193342.comp.leave

By contrast an equivalent v8.4 UM-UKCA job I submitted this afternoon
completed OK in about 20 minutes or so (that's usually now long it
takes to compile):

/home/n02/n02/gmann/output/xmhbv000.xmhbv.d16035.t132728.comp.leave

The xmhbv is basically the same job as xmhbx but is just a short
2-day job for testing some code-changes I'd carried out.

That test worked successfully so I proceeded to make those code-changes
in the main jobs (xmhbw and xmhbx) but then frustratingly the compile
is timing out.

This is doubly frustrating because I have got a 2-job-width ARCHER
reservation on at the moment — and the jobs I submitted last night
failed overnight with disk quota errors due to the problem Grenville
emailed out about with /work filling.

I re-submitted this morning the set of six 5-year 1990s transient
Pinatubo-perturbed interactive strat-trop aerosol jobs that had
crashed last night. And these are now running OK.

But the 2 jobs I'm submitting this evening are longer 15-year Pre-Industrial
Timeslice control jobs for some Krakatoa simulations we'll do.

The plan with this was to run 2 10+-year control jobs in this ARCHER

reservation (Grenville has helped me out getting this all set up).

We got that together last night but it seems to be failing.

Because of the ARCHER reservation, I've elevated the priority for this
query to be high — that's because the compile time-out is stopping
me submitting these last 2 runs to the 2-job-width reservation.

So I'll lose another overnight run-time period in the reservation.

I'm assuming this is some temporary "system problem" on ARCHER that
is causing the build or compilation to go more slowly than it should.

And I'm hoping the "system problem" may just be temporary.

But hopefully the info I've provided above may help make it easier
to identify the source of the problem.

Thanks for any help you can give.

Cheers
Graham

Change History (8)

comment:1 Changed 4 years ago by luke

Dear Graham,

Personally, I have had lots of problems compiling on the shared nodes, because they are shared. Sometimes you get one all to yourself and compilation is fine, other times its overloaded with jobs and compilation times out.

I almost always compile on the login nodes now, and for the UKCA training course I advised all the students to do the same, precisely for this reason and previous problems that I've had. Please see the following page to learn how to do it.

http://www.ukca.ac.uk/wiki/index.php/UKCA_Training:_Logging_in_and_Setting_up#Manually_compiling_UM_jobs_on_ARCHER

(it should be noted that these instructions are for vn8.4 jobs, but in principle they should work (with appropriate modifications) for any UMUI-based job).

Thanks,
Luke

comment:2 Changed 4 years ago by grenville

Luke

This is OK for the occasional build, but login nodes are shared too. Make sure you're building in /home, /work is slow handling lots of small files.

Grenville

comment:3 Changed 4 years ago by gmann

Hi Luke,

Thanks for pointing this out — that's very helpful and much appreciated.

Actually I realise maybe I should have cc'd you on the ticket as I did
for Sandip.

Or are you notified automatically about NCAS-CMS helpdesk tickets with
"ukca" specified in the keywords?

Anyway — the thing is that in this case I don't think the manual compile
would have fixed the problem because I did try running the job setting it
to start from pre-exisiting model-executables and rcf-executables for an
equivalent 2-day test job that I was able to compile successfully during
the day.

See the thing is that when I did that to run from the pre-existing executables
for the rcf and the model it then proceeded to try to run the reconfiguration
of the dump — and that then timed out too!

So maybe it was that the login nodes were just busy at that particular time?

Grenville, Luke —- what should I have done then in this case?

Is it possible to run the reconfiguration on the login nodes?
And even if it is possible is that allowed?

Cheers
Graham

comment:4 Changed 4 years ago by grenville

Graham

You can set the umui to run the model only — you can't run the reconfig on the login nodes; they don't support parallel codes.

Grenville

comment:5 Changed 4 years ago by gmann

Hi Grenville,
Yes I know you can set the UMUI to run the model only.
But that's not what I wanted to do.

We designed the experiments so that the initialisation chosen
from another run so that it had the chosen phase of the QBO
in the months after the eruption.

And we needed to do the reconfiguration to achieve this.

Well I suppose we could have changed the start date to match
and would then not have needed to run the reconfiguration.

But I didn't expect that to take that long…..

Also — re: your reply about the reconfiguration — wouldn't
it also be possible to run a parallel job from the command line?
(As long as the target architecture allowed it.)

comment:6 Changed 4 years ago by luke

Hi Graham,

I'm emailed all the CMS tickets - I try to respond to all the UKCA ones, or others that I can answer.

The timing out of reconfiguration will be a different issue. It's worth trying it again.

Can you try building from scratch manually, then qsub the rcf and run steps (you need to wait for the rcf to finish before qsub-ing the run step).

L

comment:7 Changed 4 years ago by gmann

Hi Luke & Grenville,
Thanks.
But the actually the jobs seem to have built OK today when
I submitted them late morning.

So it looks like it was a particular problem last night.

Grenville contacted the ARCHER helpdesk and they are looking
back at those particular jobs to see if they can work out
what the problem was.

Your suggestions may be helpful though in the future if I
encounter the problem again.

However last night I was finding the running of the rcf step was
also slow (when submitted from the UMUI at least).

So your suggestion above would not have got around the problem in this case.

Thanks anyway,

Cheers
Graham

comment:8 Changed 4 years ago by luke

  • Resolution set to wontfix
  • Status changed from new to closed
Note: See TracTickets for help on using tickets.