Opened 4 years ago

Closed 4 years ago

#1670 closed help (fixed)

Can't get any UKCA job to run on ARCHER

Reported by: jsabrooke Owned by: ros
Component: UM Model Keywords:
Cc: Platform: ARCHER
UM Version: 8.4

Description

Hi,

I'm really sorry that I'm asking a question that probably has a very simple answer. I did the UM/FCM and UKCA tutorials successfully in January, and need to do some real runs now but it isn't working anymore. When I run any job (I've tried the actual one I want, and both of the above tutorial jobs), I get:

Calling MAIN_SCR - local...
(This may take several minutes.)

MAIN_SCR: Calling UMSUBMIT ...

Your job directory on host login.archer.ac.uk is: /home/n02/n02/jsab500/umui_runs/xltqa-268133613

umui_runs/xltqa-268133613/SUBMIT[60]: .: /home/n02/n02/jsab500/umui_runs/xltqa-268133613/COMP_SWITCHES: cannot open [No such file or directory]
MAIN_SCR: Submit failed



Your job directory on host login.archer.ac.uk is: /home/n02/n02/jsab500/umui_runs/xltqa-268133452

umui_runs/xltqa-268133452/SUBMIT[60]: .: /home/n02/n02/jsab500/umui_runs/xltqa-268133452/COMP_SWITCHES: cannot open [No such file or directory]

I suspect based on answers to other tickets that it may be something in the .profile that needs changing. I've tried the minimal .profile in the "Setting Up" section of the UM/FCM tutorial, and also the version that I had before. This has a bit more in it, and I don't remember where I got it from…

I've checked that this umui_runs directory does exist of course, but it doesn't create the job subdirectory. I've also deleted a lot of stuff to make sure that there's enough space. I tried changing the UM_OUTDIR in FCM extract options, but whatever I do gets the same result. As far as I can tell, the ssh seems set up correctly.

Please let me know what stupid thing I've done wrong!
Thanks very much for your help,
James

Change History (6)

comment:1 Changed 4 years ago by ros

  • Owner changed from um_support to ros
  • Status changed from new to accepted

Hi James,

The job you reference, xltqa, still has a lot of settings that are for the Met Office machine, so it's not calling the correct submission scripts, etc.

The following page documents the generic changes that are required to take a Met Office job and run it on ARCHER: http://cms.ncas.ac.uk/wiki/Faq/ConvertMonsoonJobToArcher

Can you please check that you have done all of these and then try submitting again.

I am concerned that even the umui_runs directory hasn't made it across which can indicate an ssh problem. Can you confirm that you are able to login to ARCHER from PUMA without needing to enter either a password or passphrase?

Regards,
Ros.

comment:2 Changed 4 years ago by jsabrooke

Hi Ros, thanks for your response. Sorry, I shouldn't have given that example as I hadn't yet put any effort into changing it from MONSOON to ARCHER. I'll do that later, but I do get exactly the same problem for the ARCHER tutorial jobs, which I should have used as the example.

Logging into ARCHER from PUMA was working perfectly fine on Friday afternoon (i.e. it didn't ask me for the password or key passphrase, though I did have to do "ssh-add"), but the error I asked about occurred. I tried removing everything in the PUMA and ARCHER .ssh directories (except setup), and recreating the key several times, with the same result.

Then later on Friday, ssh-add stopped working too, giving the "Could not open a connection to your authentication agent" error. Removing environment.puma and relogging in does say it's initialising a new ssh agent, but I still get the same error with ssh-add.

I tried this by logging in from Graham Mann's Linux machine, and also Putty and Mobaxterm from my Windows computer, which now all give the authentication agent error. I've tried deleting and recreating the key (including deleting the authorized_keys file on ARCHER), and using various combinations of the ssh options in mobaxterm (like "Use SSH-Agent" and "Forward SSH-Agent"). If I have "Forward SSH-Agent" ticked, it now does give me the option to enter my passphrase, but then says "SSH_AGENT_FAILURE; Could not add identity: /home/jsabrooke/.ssh/id_dsa".

I don't know if this could cause any issues, but I had previously been using Pageant to log into ARCHER directly, with a different ssh key, but I deleted that key very early in my attempts on Friday.

Thanks again for your help,
James

comment:3 Changed 4 years ago by ros

Hi James,

ssh problems can be difficult to diagnose, so if we start again from a clean slate hopefully I can help.

Please don't forward ssh-agent when you log into PUMA as this will likely confuse matters.

Can you please remove everything again under your ~/.ssh directories on both PUMA and ARCHER and then follow the instructions in sections 2.7 & 2.8 on the Tutorial setup page: http://cms.ncas.ac.uk/documents/training/March2015/UM_practicals/getting_setup.html (Login with your help desk login or the generic umdoc)

Let me know how you get on and we'll take things from there.

Cheers,
Ros.

comment:4 Changed 4 years ago by jsabrooke

Hi Ros,

thanks, I hadn't remembered that there was a different set of instructions, I was using http://puma.nerc.ac.uk/trac/UM_TUTORIAL/wiki/UmTutorial/SettingUp. I don't know what was different, but the ssh works again. Maybe replacing the .profiles again? Also, I did normally have a separate direct connection to ARCHER open, which I closed before doing this, could that have been a problem?

I don't know if this matters, but when running the install-ssh-keys script, it said that it worked fine, but it didn't actually create the authorized_keys file, so I made it with "cat ~/.ssh/id_rsa.pub | ssh jsab500@… 'mkdir -p .ssh ; cat - >> ~/.ssh/authorized_keys'" from the other instructions.

Anyway, I seemed to be back where I was on Friday afternoon. I submitted the UM FCM tutorial job and it failed with the same error. But then I removed the only change that I could think of that I'd made from the basic setup, which was adding a cd to my work directory in my .bashrc file on ARCHER, as I rarely need to look in the home one. I've confirmed that this was the problem, and it now submits ok! I thought that this would be a totally innocuous command… I moved it to .profile and submitting still works.

Thanks very much for your help,
James

comment:5 Changed 4 years ago by ros

Hi James,

Glad to hear you have got it working now. Not sure why the script didn't create the authorised_keys file for you automatically - thanks for letting me know this as I will double check this out again before we run the next course incase something on ARCHER has changed.

Yes the UMUI submission scripts make an assumption that when you ssh to the remote machine that you land in your $HOME directory - so unfortunately any change of directory in there will cause a problem. If you haven't already, a lot of people add a symbolic link from their /home to /work directory (cd $HOME ; ln -s /work/n02/n02/<username> work) so they can then just do cd work once logged in. This will be safer than adding a cd to any .profile/.bashrc file.

Cheers,
Ros.

comment:6 Changed 4 years ago by ros

  • Resolution set to fixed
  • Status changed from accepted to closed
Note: See TracTickets for help on using tickets.