Opened 6 years ago

Closed 6 years ago

#1276 closed defect (answered)

occassionally model does "nothing" but use up time

Reported by: agt Owned by: um_support
Component: ARCHER Keywords: ocassional stuck model
Cc: Platform: ARCHER
UM Version: 8.5

Description

(nb Component drop-down menu: hector —> archer)

Hi,

I've been successfully (too successfully!) running a derivative of Karthee's xjlej experiment of N512 GA6 on archer.

My experimental setup is:
ensemble of 1 May to 1 October runs. Each member uses the same start dump but with the year changed in each case, 1 May 1992, 1 May 1993 etc, so that different year's SSTs are used in the ensemble. I drive each member from the same compiled code and reconfiguration.

One such member over the weekend did its reconfiguration then went on to start its run from 1 May 1996. But by the end of its 6hours requested time it hadn't even got as far as 11 May (the next dumping time). Normally I would expect at least 2.5 or 3 months in this time.

See the leave file at /home/n02/n02/agt/umui_out/xjudl000.xjudl.d14103.t100808.leave (time stamp Apr 13 16:21), which doesn't yield any clues!

The attached screenshot was the state of the model output directory after the model had stopped and nothing was running. (Time of screenshot ~8pm.) As you can see, it has started filling the various pp streams for May but none of the dated ones reach as far as 11 May. The leave file suggested the model stopped once 21000s was exceeded, so this potentially can waste a lot of resource!

Having now changed from NRUN—>CRUN, experiment xjudl has continued fine,

Any suggestions appreciated!

cheers,

Andy

Change History (6)

comment:1 Changed 6 years ago by agt

screenshot too large to attach!

See /home/sws05agt/ScreenShot2014-04-13at20.06.36.png on Reading system.

cheers,

Andy

comment:2 Changed 6 years ago by grenville

Andy

I believe this is a problem with ARCHER - they do a disc diagnostic check on Sundays which can interact badly with codes which have lots of IO.

We are in discussion with ARCHER about this. I shall forward your ticket to them to stress that their disc maintenance is adversely affecting users.

I shall ask for a refund on your behalf - no guarantees.

Grenville

comment:3 Changed 6 years ago by grenville

Andy

The proposed solution to this problem is effectively to not run on Sundays - that is not acceptable to us, but until we have proper solution, that appears to be the only work around.

Grenville

comment:4 Changed 6 years ago by luke

Hi Grenville,

Just to say that this happens to my runs a lot as well - it happened yesterday for example.

See /home/n02/n02/luke/output/xjgpx054.xjgpx.d14103.t081751.leave - killed at 6pm on last night.

Thanks,
Luke

comment:5 Changed 6 years ago by agt

Hi Grenville,

thanks for finding out,

cheers,

Andy

comment:6 Changed 6 years ago by grenville

  • Component changed from HECToR to ARCHER
  • Resolution set to answered
  • Status changed from new to closed
Note: See TracTickets for help on using tickets.