Opened 7 years ago

Closed 7 years ago

#932 closed help (fixed)

Timed out when submitting jobs to Monsoon

Reported by: cenright Owned by: ros
Component: MONSooN Keywords: rsync time out
Cc: Platform: MONSooN
UM Version:

Description

Hi,

When submitting jobs to Monsoon from puma I am getting the following error:

"Renaming SUBMIT…
Changing SUBMIT permissions…
Running SUBMIT script…
Directory /scratch/localtemp/cenrig.5964032 created

Your job directory on host ibm02 is: /home/cenrig/umui_runs/xhpqa-283084050

Total PEs : 128
NOTE: You are requesting the use of 4 node(s) on the IBM

ERROR: Timed out, puma.nerc.ac.uk not responding while attempting to access account cenrig on host ibm02. Note that repeated failures may result in expiry of password due to security procedures on some machines. Check user id, hostname and password for your account on the host machine."

This started happening mid-day yesterday and since then only one of maybe half a dozen attempts has succeeded.

I am working from home so it is possible that a slow link at my end is causing the problem - though it is difficult to see why this should be.

I enter my Passcode (and when necessary my puma password) without trouble and from the above appear to have got through to ibm02 successfully.
I can also login to lander and thence ibm02 without trouble.

Clare (cenrig on monsoon, cenright on puma)

Change History (3)

comment:1 Changed 7 years ago by ros

  • Owner changed from um_support to ros
  • Status changed from new to accepted

Hi Clare,

This is being investigated. Unfortunately due to it's intermittent nature it's proving rather tricky to track down as I can't repeat it on demand.

I'll let you know when we have further information.

On the very odd occasion I've encountered this when submitting a job, I found that an immediate resubmission goes through fine.

Regards,
Ros.

comment:2 Changed 7 years ago by ros

  • Keywords rsync time out added
  • Platform set to MONSooN
  • UM Version <select version> deleted

Hi Clare,

Are you still encountering the problem with timing out when submitting jobs to MONSooN?

We've not found a cause of this yet, however, I'm not aware of it having occurred recently and am wondering whether any of the MONSooN maintenance or PUMA reboot have fixed this problem.

Cheers,
Ros.

comment:3 Changed 7 years ago by ros

  • Resolution set to fixed
  • Status changed from accepted to closed

Update:

The Met Office put a fix in a couple of weeks ago which we believe has fixed the time out problems with the MONSooN HPC/lander. If you continue to experience this problem, please do let us know. I will close this ticket now, however, you can re-open it should the problem be found to persist.

Note: See TracTickets for help on using tickets.