#1739 closed help (answered)

Problems with archiving on ARCHER

Reported by: gmann Owned by: um_support
Priority: normal Component: UM Model
Keywords: archiving Cc: j.cole@…, nla27@…, mohit.dalvi@…, m.yoshioka@…, eelrm@…
Platform: ARCHER UM Version: 8.4

Description

At 9.59am this morning (Fri 20th Nov) I emailed Luke A and Jeff C (with
cc: to Mohit Dalvi) to raise query about problems I was having with
automatic archiving.

In a different email Jeff advised that I should not really be contacting
him directly and rather do this via the NCAS-CMS helpdesk ticket-based
system (so that it can be archived and the outcome/issue shared amongst
other users).

That makes absolute sense to me — sorry for contacting them directly.

I've put the email replies to my query below and joining it with this ticket.

In short I was using a version of the vn8.4 hector_monsoon_archiving branch
as used by Masaru Yoshioka at Leeds Univ (cc'd for info):

fcm:um_br/dev/ggxmy/vn8.4_hector_monsoon_archiving_MY/src

Myself and Lauren Marshall (PhD student at Leeds, cc'd for info) been
having problems with qsserver failures ("T qsserver failure") when
using the standard archiving branch from Luke:

fcm:um_br/dev/luke/vn8.4_hector_monsoon_archiving_ff2pp/src

Both Lauren and I had reported those previous problems in a separate
NCAS-CMS helpdesk tickets:

http://cms.ncas.ac.uk/ticket/1656

and

http://cms.ncas.ac.uk/ticket/1722

In Lauren's ticket she tried various things as suggested by
the NCAS-CMS helpdesk team but it did not fix the problem.

After discussing it with other people at Leeds, Masaru Yoshioka
explained that he had switched to using a different archiving
branch to Luke's — I did not know what the difference was here
but Masaru explained it seemed to be working OK for him.

So it sounds like it's a tricky one.

Based on Masaru's experience I decided to switch over to using
Masaru's branch (see above) rather than the suggested one from Luke.

Masaru had chosen to archive to a sub-directory of his off his
/work rather than to the RDF disk (because he'd encountered
issues when running a very large number of job at the same time
—Grenville had advised to archive to /work rather than the RDF
as it places less strain on the system.

The time I was switching to Masaru's branch was about the time
the RDF went down and that, together with Masaru's decision to
do this anyway, led to me deciding also to try to get the branch
to archive to a directory off my /work rather than on the RDF.

But this is not working for me for some reason.

So I decided to email Luke, Jeff Cole (with cc: to Mohit) to seek
their view directly.

Jeff has advised me to switch back to using Luke's branch but also
pointed out that Luke now has a "v2" of this archiving branch but
that Luke should advise whether to use that or not.

Luke has advised to stick to using the first one (presumably that
is still in development/testing).

So this evening I'm going to resubmit the job going back to the
original archiving branch. Maybe those original failures myself
and Lauren encountered have been resolved now after the repairs
to the RDF?

Anyway — we shall see — I'll submit the jobs and hopefully it
will all work nicely.

Finally, I also noticed just now that there is also another ticket reporting this problem about 6 months ago — there's no reply apparent there (to my
eyes at least):

http://cms.ncas.ac.uk/ticket/1584

I've put the email replies from Luke and Jeff below for info.

Cheers
Graham


From: Dr N.L. Abraham nla27@… On Behalf Of Luke Abraham
Sent: 20 November 2015 16:40
To: Jeff Cole
Cc: Graham Mann; Mohit Dalvi
Subject: Re: Failing to archive to /work — memory error (due to too many output streams?)

Hi All,

Don’t use the v2, use the original.

Thanks,
Luke

On 20 Nov 2015, at 16:28, Jeff Cole <j.cole@…> wrote:

Hi Graham

I’m having a problem with archiving on ARCHER.

The job is running OK and proceeding happily.

But the automatic archiving is failing to archive the different
output streams I’ve added;.

The scripts are still deleting superseded climate-means etc (when
requested).

But the ff2pp (or whatever script does the conversion to pp
(conv2pp?) is failing to do so.

As a consequence the files are being deleted as the job runs and
there’s nothing in the directory I’ve told the job to archive across to.

Please can you have a look at my xmbcy job for example.

The archiving to /nerc was working fine for me previously.

So I think it must be something to do with the high amounts of
special output streams I’ve added?

Perhaps the files the ff2pp/conv2pp script is trying to convert has
too much data in them?

The .leave file for that is at:

/work/n02/n02/gmann/UM_output_Files_11Nov2015_to_20Nov2015/xmbcy000.x
mbcy.d15324.t014026.leave

I know this is a very big file – I added some debug output to this
which is why it is unusually high.

In fact this led to me disk quota being exceeded last night and the
job crashed (it would have continued on otherwise happily).

So you see the problem I would like you to look at is to do with the
archiving (not the crash which I know is due to the disk quota).

If you open that file with “less” you can navigate through it like in
vi (even though it’s a giant file).

See it says “Memory fault” and is dumping a core.

I didn’t see this error message before in my xmbcz job – but I
guessed it might be an memory issue because it was generating a core file.

Please can you help me with this – it’s quite urgent as these runs
are needed for VolMIP consensus forcing experiment for Tambora (see
xmbcs)

Many Thanks for your help

Cheers

Graham

/work/n02/n02/hum/vn8.4/cce/utils/ieee: line 270: 16403: Memory
fault(coredump)

/work/n02/n02/gmann/xmbcy/bin/qshector_arch: line 62: 16400: Memory
fault

/work/n02/n02/gmann/xmbcy/bin/qshector_arch[63]:
/work/n02/n02/hum/bin/convpp: not found [No such file or directory]

Try using

fcm:um_br/dev/luke/vn8.4_hector_monsoon_archiving_ff2pp/src

instead of

fcm:um_br/dev/ggxmy/vn8.4_hector_monsoon_archiving_MY/src

This uses ff2pp instead on ieee and convpp.

Luke also has

fcm:um_br/dev/luke/vn8.4_hector_monsoon_archiving_ff2pp_v2/src

But he will have to say whether thats a better branch to use or not.

Jeff.

*
Dr N. Luke Abraham
Senior Research Associate
National Centre for Atmospheric Science
e-mail: luke.abraham@…
*
Centre for Atmospheric Science,
Department of Chemistry,
University of Cambridge,
Lensfield Road,
Cambridge, CB2 1EW, UK.
*
phone : +44 (0)1223 7 48899
fax : +44 (0)1223 763 823
http://www.ch.cam.ac.uk/person/nla27
*

Change History (2)

comment:1 Changed 21 months ago by grenville

Graham

Are you still experiencing archiving problems — if so, please send me a jobid.

NOTE
"—Grenville had advised to archive to /work rather than the RDF
as it places less strain on the system."

My advice appears to have been misunderstood - -there is absolutely no point archiving to /work

Grenville

comment:2 Changed 19 months ago by ros

  • Resolution set to answered
  • Status changed from new to closed
  • UM Version changed from <select version> to 8.4
Note: See TracTickets for help on using tickets.