Opened 4 years ago

Closed 4 years ago

#1646 closed help (answered)

UM jobs running very slowly

Reported by: ggxmy Owned by: um_support
Component: UM Model Keywords: slow
Cc: Platform: ARCHER
UM Version: 8.4

Description

Dear CMS,

I submitted an ensemble of 10 UM jobs (vn8.4+UKCA) on ARCHER, and 8 of these jobs ran at a reasonable speed (~3 hours/month), but two of them slowed down after simulating about 7 months. The job ids are tdyu.j and k.

Simulations started from December 2007 and they seem to have run in the same way as other jobs up to June 2008. After I resubmitted the ensemble, it started to slow down.

It took less than 3 hours to simulate June 2008, but ~4 hours for July and ~12 hours for August. Then the simulations of other jobs completed but the simulations of these jobs were ended in the middle of September after 24 hour walltime limit. I resubmitted the jobs again starting from September 2008 as an ensemble of two jobs. Then it took ~7 hours for September. The jobs are currently simulating October and taking more than 3 hours to simulate 15 days.

The current job name for the ensemble is tdyuk000 and job ID on the queuing system is 3131210.sdb.

I checked some of the output diagnostics and they looked OK.

These jobs produce large sizes of outputs but the situation is the same as eight other jobs. These also outputs large amount to .leave file because I forgot to suppress the print statements. That's my fault and I will be more careful next time. But again, that is the same as other jobs.

Can you see any problem in these jobs/runs?

Thanks,
Masaru

Change History (2)

comment:1 Changed 4 years ago by grenville

Masaru

ARCHER are running RAID system checking and undertaking hardware replacement to fix the problems they had with disc failures earlier in the year.

ARCHER have made several announcements about possible slow downs and have published the RAID system check schedule.

The latest message is repeated here - please make sure you are getting these messages from ARCHER

Dear Users

As a pro-active measure and a final 'tidy up' after the file system issues experienced

earlier in the summer, Cray have just started replacing a number of disks across
all file systems in ARCHER. This is being done in a carefully planned way to minimise
user impact by swapping out a small number of disks at a time and allowing the OST
to rebuild the disks using the same method used when a disk fails naturally. It
is anticipated that the majority of disk replacements and rebuilds will cause minimal
user impact. However there are two periods of 2 to 3 days, between 1 - 3 September
and 15-17 September when the disk rebuilds have a higher probability to impact on
read performance and you should take this into account when planning any work during
these periods. We are not anticipating that the system will be taken down during
these two periods and we would also like to stress that from past experience that
only jobs using significant IO are likely to be impacted.

Some unplanned rebuilds may fall outside the above dates. This is the system correcting

errors found during normal monitoring and is working as intended to maintain disk
integrity. Depending on the disk concerned, this may cause degraded read performance
which may unfortunately impact some users.

We will remind you 48 hours before the start of one of the planned higher impact

periods and at that point will confirm any minor changes in dates.

If you have any queries or require any help please don?t hesitate to get in touch.

Best regards

The ARCHER Helpdesk Team
support@…

Grenville

comment:2 Changed 4 years ago by ros

  • Resolution set to answered
  • Status changed from new to closed
Note: See TracTickets for help on using tickets.