Opened 6 months ago

Closed 6 months ago

#3355 closed help (fixed)

Unable to run JULES on JASMIN

Reported by: mtodt Owned by: pmcguire
Component: JULES Keywords: JULES rose suite fail JASMIN, bsub
Cc: Platform: JASMIN
UM Version:

Description

Hi

Since yesterday evening, I can't run JULES suites on JASMIN anymore. The same suites worked previously, but they failed yesterday evening before exceeding the wall time and when I try to submit them again I get the following in job.out:

Parsing application description...
Identifying hosts...
Spawning processes...
(gnome-ssh-askpass:27684): Gtk-WARNING **: cannot open display: jasmin-cylc.ceda.ac.uk:13.0
(gnome-ssh-askpass:27683): Gtk-WARNING **: cannot open display: jasmin-cylc.ceda.ac.uk:13.0
Host key verification failed.
Host key verification failed.
(gnome-ssh-askpass:27685): Gtk-WARNING **: cannot open display: jasmin-cylc.ceda.ac.uk:13.0
(gnome-ssh-askpass:27686): Gtk-WARNING **: cannot open display: jasmin-cylc.ceda.ac.uk:13.0
(gnome-ssh-askpass:27687): Gtk-WARNING **: cannot open display: jasmin-cylc.ceda.ac.uk:13.0
(gnome-ssh-askpass:27688): Gtk-WARNING **: cannot open display: jasmin-cylc.ceda.ac.uk:13.0

In job.err, I only see:

mpirun: Warning one or more remote shell commands exited with non-zero status, which may indicate a remote access problem.
[FAIL] rose-jules-run <<'__STDIN__'
[FAIL] 
[FAIL] '__STDIN__' # return-code=255

I've also noticed that .bsub scripts that I use for post-processing are now taking about 4 hours instead of 1 hour, which happens roughly since the time my suites failed yesterday evening.

Many thanks in advance for your help!

Cheers
Markus

Change History (12)

comment:1 Changed 6 months ago by pmcguire

Hi Markus
What is the ID number of the JULES suite that is failing on jasmin-cylc?

What is the path to the failing .bsub script?
Patrick

comment:2 Changed 6 months ago by pmcguire

  • Keywords JASMIN, bsub added; JASMIN removed
  • Status changed from new to accepted

comment:3 Changed 6 months ago by mtodt

Hi Patrick

Thanks for picking this up! One of the suite is u-bv463. The sub scripts are working, just slower than in the past days. Although my last one was done in 1.5 hours or so, so faster than yesterday evening or last night.

Cheers
Markus

comment:4 Changed 6 months ago by pmcguire

Hi Markus:

You're more than welcome! It's my responsibility to handle all the tickets with component as JULES or Land Surface Modelling here on the CMS Helpdesk. If I can't figure it out myself, I can ask other people for help/guidance.

I have started running a copy of your u-bv463 suite. It has finished the fcm_make compiling stage, and has started running the jules task. It has the same problem that you describe.

I have also tried to dynamically restart the jules task with different settings so that the queueing time is less. The different settings are so that it uses 0:30 mins of wall clock time instead of 48:00 hours of wall clock time. Then I did a rose suite-run --reload and a retrigger of the failed jules task in the xcylc GUI. This speeds things up a bit, and I can also skip the succeeded fcm_make task.

But to speed things up even further with the queuing, I wanted to use 2 processors instead of 32 processors. In order to do that, I switched from reading from a start dump file to spinning up from an idealized state. If I don't make this switch, then I am not sure if it would work or not. Regardless, with this switch and with 2 processors, it doesn't need nearly as long to queue. And it crashes much quicker, but with a different error. Maybe you can use this queue-speedup technique (or something similar) to speed things up a bit, and sort out the 1st error that we're both having?


You didn't say which bsub scripts you're having problems with. But I saw some bsub log files from August 18 in the directory of your copy of the u-bv463 suite. Some notes, if these are log files for the bsub scripts that you're having problems with:

1) As you are probably aware, sometimes the nopw GWS hard drives are slow, or can't handle lots of small files quickly.

2) It appears that you are using mv in the bsub script to move files to the nopw GWS. You might consider using:

a) mkdir destdir
b) bcopy -s srcdir -d destdir

this will use multiple processors to do the copying. So it can be faster. Sometimes you have to wait in the queue for this though.

3) In my personal experience, using mv especially across file systems (from one drive to another drive) can be dangerous, with files getting lost or corrupted, especially for big files. You might consider using cp -p and then later rm after the successful copy.

Patrick

comment:5 Changed 6 months ago by pmcguire

Hi Markus
I sent you comment #comment4 above.

But with regards to the ssh issue that we both encounter, maybe see this: https://github.com/cedadev/ceda-notebooks/issues/12

Also, I should note, that recently, I could not use ssh -AX to access some or all computers from my Mac, maybe only in the VPN. I had to use ssh -AY instead, which has lower security. Maybe you had to do the same thing? This might be related to the GitHub issue discussed by Ag Stephens above.
Patrick

comment:6 Changed 6 months ago by mtodt

Hi Patrick

Thanks a lot for your help! I'll have a look at the link you sent me. I don't quite understand the speeding-up procedure, though. I never have to queue for more than a minute or so.
I'm still puzzled by the suites not working anymore. As I said at the top, I hadn't changed anything — well, nothing important that is, only the dump file. I don't understand why the suites aren't working anymore or why, if something has changed on JASMIN, neither of us has heard about it.
I had submitted my suites again yesterday evening. They failed after about 15 minutes with the same error message as before. Usually they start producing output after 1 or 2 minutes, but there's no output at all. So they fail before producing any output, but take about 15 minutes to get there. I just don't understand what's going on here.

The sub scripts I talked about were using cdo commands. They've been running almost as fast as before since yesterday afternoon, so that's not really an issue anymore. I mentioned them at the beginning because I was wondering whether something (some work, for example) was being done on/to JASMIN that would cause both the slowing down of the scripts and the suites crashing.

Cheers
Markus

comment:7 Changed 6 months ago by pmcguire

Hi Markus:
When I was running your suite, with 48:00 hours of wall clock time and 32 processors, sometimes I had to queue for 15-30 mins before the run started. That's why I was talking about speeding up the queueing.

Do you use ssh -AX or ssh -AY?

Also, I noted that your suite wouldn't create log file output unless the directory to where you were redirecting the log files already existed.

Patrick

comment:8 Changed 6 months ago by mtodt

Hi Patrick

THat's weird, as I said I never have to wait in the queue for long.

I use ssh -AY. I can try ssh -AX but I get this when logging in:

  File "/usr/lib64/python2.6/runpy.py", line 122, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/usr/lib64/python2.6/runpy.py", line 34, in _run_code
    exec code in run_globals
  File "/apps/contrib/metomi/rose-2019.01.0/lib/python/rosie/ws_client_cli.py", line 25, in <module>
    from rosie.ws_client import (
  File "/apps/contrib/metomi/rose-2019.01.0/lib/python/rosie/ws_client.py", line 36, in <module>
    from rosie.ws_client_auth import RosieWSClientAuthManager
  File "/apps/contrib/metomi/rose-2019.01.0/lib/python/rosie/ws_client_auth.py", line 38, in <module>
    import gtk
  File "/usr/lib64/python2.6/site-packages/gtk-2.0/gtk/__init__.py", line 64, in <module>
    _init()
  File "/usr/lib64/python2.6/site-packages/gtk-2.0/gtk/__init__.py", line 52, in _init
    _gtk.init_check()
RuntimeError: could not open display
Error: Unable to access Rosie with given password

Cheers
Markus

comment:9 Changed 6 months ago by mtodt

Hi Patrick

It seems like the suite is running successfully when I login via ssh -AX, but I can't open the control window with rose sgc. For that I have to login via ssh -AY. So I suppose I have to login with -AX, submit the run, logout, and login with -AY if I want to control the run. That's fine with me, but I imagine it's not how it's supposed to work.

Anyway, thanks a lot for your help!

Cheers
Markus

Last edited 6 months ago by mtodt (previous) (diff)

comment:10 Changed 6 months ago by mtodt

Addendum:

I just submitted another suite while I had been logged in via -AY to check on the first suite I had submitted, and that second suite seems to run successfully as well. Maybe I just had to login via -AX once to reset something? I'm glad it seems to work again now, but I'm nevertheless still puzzled, maybe even more than before.

Cheers
Markus

comment:11 Changed 6 months ago by pmcguire

Thanks for your feedback, Markus.
It looks like it's working now (as of August 28).
So I will close the ticket.
If you have new issues, don't hesitate to open a new ticket.
Patrick

comment:12 Changed 6 months ago by pmcguire

  • Resolution set to fixed
  • Status changed from accepted to closed
Note: See TracTickets for help on using tickets.