Opened 10 years ago

Closed 10 years ago

#578 closed help (fixed)

Timeout of extract when submitting to HECToR phase2a and phase2b

Reported by: luke Owned by: ros
Component: HECToR Keywords: FCM,PUMA,HECToR
Cc: Platform:
UM Version: 7.3

Description

Hello,

Several people in Cambridge have been having problems with timeouts on FCM extract for sending UM(UKCA) jobs to compile and run on HECToR, both the phase2a and phase2b machines. This seems to have occurred after the name-change yesterday. I'm not sure if this a HECToR, a PUMA or a FCM issue though!

My jobs affected are, e.g. xftie, xftia.

Thanks,
Luke

Change History (3)

comment:1 Changed 10 years ago by ros

  • Owner changed from um_support to ros
  • Status changed from new to accepted

Hi Luke,

This is a problem that unfortunately occurs intermittently. When the load on the HECToR login nodes is high, it causes the rsync of the umbase directory from PUMA to HECToR to take a lot longer than usual (should take around 10s). I monitored this a couple of months ago and got a very wide range of times and set the timeout threshold to be 10mins which allowed all but a couple of extreme situations. Whilst I can up that limit, I have to be careful not to go too far as it could mean hanging around longer when a legitimate error occurred.

If you try submitting again now it should be ok (It's taking between 26s and 7mins for me now). At present there is nothing we can do to improve the rsync speed. I have talked with HECToR about this and am looking into the possibility of using something other than rsync.

Regards
Ros.

comment:2 Changed 10 years ago by luke

Hi Ros,

The jobs seem to be going through fine now, for the most part. Since it was happening directly after the name-change I assumed it was related.

Thanks for your help,
Luke

comment:3 Changed 10 years ago by ros

  • Resolution set to fixed
  • Status changed from accepted to closed

This ticket is now being closed. A suggested workaround is to compile on /home - details of how to do this are copied below.


(Email sent to puma list on 16.02.2011)

Note: This information applies only to UM Versions that use FCM (i.e UM Versions 6.6.3 & 7.x)

The performance issues with the esLustre (/work) filesystem are being actively pursued with HECToR, but in the meantime, if slow submission/compilation are experienced, then we suggest compiling the job on the /home filesystem which is not affected by filesystem load in the same way as lustre is.

To compile on /home you will need to make the following change in the UMUI window FCM Configuration —> FCM Extract and Build Directories:

Change Target machine root extract directory (UM_ROUTDIR) to point to a location in your /home directory.

Your job will then compile in /home. The location of the executable, scripts and run output files will remain unchanged; they will be placed in your specified /work directories as usual.

Note: See TracTickets for help on using tickets.