Opened 10 years ago
Closed 10 years ago
#578 closed help (fixed)
Timeout of extract when submitting to HECToR phase2a and phase2b
Reported by: | luke | Owned by: | ros |
---|---|---|---|
Component: | HECToR | Keywords: | FCM,PUMA,HECToR |
Cc: | Platform: | ||
UM Version: | 7.3 |
Description
Hello,
Several people in Cambridge have been having problems with timeouts on FCM extract for sending UM(UKCA) jobs to compile and run on HECToR, both the phase2a and phase2b machines. This seems to have occurred after the name-change yesterday. I'm not sure if this a HECToR, a PUMA or a FCM issue though!
My jobs affected are, e.g. xftie, xftia.
Thanks,
Luke
Change History (3)
comment:1 Changed 10 years ago by ros
- Owner changed from um_support to ros
- Status changed from new to accepted
comment:2 Changed 10 years ago by luke
Hi Ros,
The jobs seem to be going through fine now, for the most part. Since it was happening directly after the name-change I assumed it was related.
Thanks for your help,
Luke
comment:3 Changed 10 years ago by ros
- Resolution set to fixed
- Status changed from accepted to closed
This ticket is now being closed. A suggested workaround is to compile on /home - details of how to do this are copied below.
(Email sent to puma list on 16.02.2011)
Note: This information applies only to UM Versions that use FCM (i.e UM Versions 6.6.3 & 7.x)
The performance issues with the esLustre (/work) filesystem are being actively pursued with HECToR, but in the meantime, if slow submission/compilation are experienced, then we suggest compiling the job on the /home filesystem which is not affected by filesystem load in the same way as lustre is.
To compile on /home you will need to make the following change in the UMUI window FCM Configuration —> FCM Extract and Build Directories:
Change Target machine root extract directory (UM_ROUTDIR) to point to a location in your /home directory.
Your job will then compile in /home. The location of the executable, scripts and run output files will remain unchanged; they will be placed in your specified /work directories as usual.
Hi Luke,
This is a problem that unfortunately occurs intermittently. When the load on the HECToR login nodes is high, it causes the rsync of the umbase directory from PUMA to HECToR to take a lot longer than usual (should take around 10s). I monitored this a couple of months ago and got a very wide range of times and set the timeout threshold to be 10mins which allowed all but a couple of extreme situations. Whilst I can up that limit, I have to be careful not to go too far as it could mean hanging around longer when a legitimate error occurred.
If you try submitting again now it should be ok (It's taking between 26s and 7mins for me now). At present there is nothing we can do to improve the rsync speed. I have talked with HECToR about this and am looking into the possibility of using something other than rsync.
Regards
Ros.