Opened 7 years ago

Closed 6 years ago

#988 closed error (fixed)

Network timeout submitting jobs to MONSooN

Reported by: kipling Owned by: ros
Component: MONSooN Keywords:
Cc: aj.watling@… Platform: MONSooN
UM Version: 7.3

Description

I'm currently having problems submitting jobs from PUMA to MONSooN.
The problem appears to be a very high latency establishing ssh connections from ibm02 back to PUMA (as done by the REMCOMMS script).

In the UMUI job submission window I see (after entering my passcode):

  Copying files to directory /projects/ukca/zkipli/build/xhxoy/umbase using rsync...
  See /projects/ukca/zkipli/build/xhxoy/umbase/ext.out for output
  Timed out, lander.monsoon-metoffice.co.uk not responding

  Tidying local directories...
  Job submission failed

and in the "GHUI errors and warnings" dialog:

  ERROR: Timed out, lander.monsoon-metoffice.co.uk not responding while attempting to access account zkipli on host  ibm02. Note that repeated failures may result in  expiry of password due to security procedures on some  machines. Check user id, hostname and password  for your account on the host machine.

while in my umbase/ext.out (output from REMCOMMS) on MONSooN I see:

  mkdir -p /projects/ukca/zkipli/build/xhxoy/umbase/cfg
  rsync -a --exclude=.* --delete-excluded --timeout=1800 --rsh=ssh -v kipling@puma.nerc.ac.uk:/home/kipling/um/um_extracts/xhxoy/umbase/cfg/bld.cfg /projects/ukca/zkipli/build/xhxoy/umbase/cfg

  ***************************** WARNING ******************************
  This is a private computer facility.  Access for any reason must be
  specifically authorized by the owner
  ********************************************************************


  receiving incremental file list

  sent 19 bytes  received 53 bytes  1.21 bytes/sec
  total size is 2107  speedup is 29.26
  mkdir -p /projects/ukca/zkipli/build/xhxoy/umbase/cfg
  rsync -a --exclude=.* --delete-excluded --timeout=1800 --rsh=ssh -v kipling@puma.nerc.ac.uk:/home/kipling/um/um_extracts/xhxoy/umbase/cfg/ext.cfg /projects/ukca/zkipli/build/xhxoy/umbase/cfg
  rsync error: received SIGINT, SIGTERM, or SIGHUP (code 20) at rsync.c(549) [Receiver=3.0.9]

Manual testing shows that ssh takes a long time (almost a minute) to establish a connection from ibm02 back to PUMA, hence I suspect this may be the problem. I'm not sure whether the fault is at the MONSooN or PUMA end though.

Change History (22)

comment:1 Changed 7 years ago by willie

Hi Zak,

There have been occasional problems with submission to MONSooN timing out - we're investigating. Often the problem has been resolved simply by resubmitting. Try this and let us know how it goes.

Regards,

Willie

comment:2 Changed 7 years ago by kipling

Thanks — this repeated quite a number of times, but after many attempts it has now submitted successfully (and the latency in ssh-ing back to puma has gone).

comment:3 Changed 7 years ago by luke

Hi Willie,

I was wondering if there was any further news on this problem? I've seen this four times today and have been unable to submit so far.

Many thanks,

Luke

comment:4 Changed 7 years ago by kipling

Indeed, I've had a very low success rate submitting since yesterday afternoon.

Interestingly, it doesn't appear to be ssh or rsync itself which fails, but something else which times out too quickly and kills it (is there a separate submission timeout mechanism within the UMUI, FCM or one of the PUMA-side submit scripts perhaps?).

If I run the appropriate "REMCOMMS" script by hand on MONSooN then (provided I have the necessary SSH key loaded) everything copies across from PUMA fine and the job submits. (It's still slow, but no timeouts occur.)

comment:5 Changed 7 years ago by luke

Interesting. Running REMCOMMS manually does work and the job is submitted. It doesn't seem to be that slow either.

comment:6 Changed 7 years ago by ros

  • Owner changed from um_support to ros
  • Status changed from new to accepted

Hi,

The Met Office are still trying to track this one down. I believe they made a change on Wednesday which we hoped would have a positive impact, but doesn't seem to. 90% of the time the problem with submission does indeed coincide with problems ssh'ing from MONSooN to PUMA. I have a script that's been running over the past week checking out the ssh timings out of MONSooN. There is a pattern which I have passed onto the MONSooN team. Usually ssh out takes around 1 or 2 secs, when the submission problem happens it's in the order of 60+s. I've already modified the UMUI to set its timeout to 10mins which hasn't helped, there is still something else causing the timeout. Unfortunately that's all I can say right now, obviously this will continue to be looked into as a priority.

Regards,
Ros.

comment:7 Changed 7 years ago by ros

  • Resolution set to fixed
  • Status changed from accepted to closed

Hi Zak,

The Met Office put a fix in a couple of weeks ago which we believe has fixed the time out problems with the MONSooN HPC/lander. If you continue to experience this problem, please do let us know. I will close this ticket now, however, you can re-open it should the problem be found to persist.

Regards,
Ros

comment:8 Changed 7 years ago by kipling

  • Resolution fixed deleted
  • Status changed from closed to reopened

I'm getting this problem again (yesterday and today), although it doesn't seem to be as persistent as before — i.e. most submissions go through but occasionally one still fails.

comment:9 Changed 7 years ago by ros

Hi Zak,

When you encounter the timeout problem again can you please send me precise timings of when the problems occur. This will enable us to look in the logs and try to figure out what was happening at the time.

If you know the exact time you encountered the problem yesterday and today please let me know.

Thanks for letting us know.

Regards,
Ros.

comment:10 Changed 7 years ago by kipling

Thanks. This morning's instance happened just after 11:31:43 (that's the timestamp on the REMCOMMS file on MONSooN; it was the umbase rsync which failed) for job xibce. I'm afraid I don't have the times for yesterday, but I'll keep a note if it happens again.

comment:11 Changed 7 years ago by kipling

And again today (14 Feb), three times in a row:

xigbb 15:20:41
xigbc 15:21:13
xigbd 15:22:58

ssh from ibm02 to puma looks fine, although the rsync step still seems to be quite slow even when I run REMCOMMS by hand.

comment:12 Changed 7 years ago by kipling

And again at 11:38:26 this morning (20 Feb).

comment:13 Changed 7 years ago by kipling

Scratch that last one — it appears it failed because the UKCA /projects quota on MONSooN is exhausted.

comment:14 Changed 7 years ago by kipling

And again at 15:21:55 and 15:23:49 this afternoon.

comment:15 Changed 7 years ago by kipling

And again today, three times in a row:

11:33:21
11:35:31
11:38:40

comment:16 Changed 7 years ago by kipling

And again today, at 15:48:19.

comment:17 Changed 7 years ago by kipling

And four more, at 11:28:36, 11:30:04, 12:33:56 and 12:35:10 today.

comment:18 Changed 7 years ago by kipling

And again at 16:53:20 this afternoon.

comment:19 Changed 6 years ago by kipling

And again at 10:59:04 this morning.

comment:20 Changed 6 years ago by ros

  • Cc aj.watling@… added

Hi Zak,

Thanks for letting us know.

Regards,
Ros.

P.s. I've add AJ to the cc: for this ticket so he gets alerted as soon as this ticket is updated.

comment:21 Changed 6 years ago by kipling

And again, at 14:15:12 today.

comment:22 Changed 6 years ago by ros

  • Resolution set to fixed
  • Status changed from reopened to closed

We are unaware of any further occurrences of this problem. I will now close this ticket as fixed. If you encounter this problem again please let us know.

Regards,
Ros.

Note: See TracTickets for help on using tickets.