Opened 5 years ago

Closed 5 years ago

#1377 closed help (fixed)

CRUN run not re-submitting

Reported by: webber24 Owned by: ros
Component: UM Model Keywords:
Cc: Platform: ARCHER
UM Version: 8.4

Description

Dear Helpdesk,

I am having difficulties regarding resubmission with a job (xkctj). I have compiled the model and run successfully for a month. Following this I selected the continuous model run (CRUN), with a long resubmission time (easily long enough for the one month I require). Unfortunately the job surpasses the resubmission and eventually runs out of time after generating 3 months of output (2 months more than the resubmission period) and then not resubmitting.

Hopefully you can help with this issue,

Many Thanks,

Chris

Change History (15)

comment:1 Changed 5 years ago by ros

Hi Chris,

Could you please give us permission to read your ARCHER directories.

  chmod -R g+rX /home/n02/n02/webber24
  chmod -R g+rX /work/n02/n02/webber24

Cheers,
Ros.

comment:2 Changed 5 years ago by webber24

Hi Ros,

That's been done for you

Cheers,

Chris

comment:3 Changed 5 years ago by ros

Hi Chris,

When you set the CRUN going on 24th Sept it picked up from the previous run and started at 02/02/2003. The last dump it wrote was for 01/03/2003, so it did only run for 1 month as you specified. However, I can see from the output that all PEs exited successfully but the job just hung until the walltime expired. I have seen this behaviour before but with UM vn8.2 on ARCHER. On Monday I will look to see if this is the same problem or not.

Regards,
Ros.

comment:4 Changed 5 years ago by webber24

Hi Ros,

Just wondered whether there had been any breakthroughs on this one as yet?

Many Thanks,

Chris

comment:5 Changed 5 years ago by ros

Hi Chris,

Sorry, yes I was just trying something out, but the ARCHER queues are being incredibly slow. Could you try turning off post-processing in window Post Proc > Main Switch and see if that solves the problem?

Cheers,
Ros.

comment:6 Changed 5 years ago by webber24

Hi Ros,

Thanks for your help, that has seemed to solve the issue!

Cheers,

Chris

comment:7 Changed 5 years ago by webber24

Hi Ros,

Before you close this ticket I have a quick question and a new issue. The question refers to archiving data, is this possible with the above switch turned off? I ask this because switching this off shades out the field to archive data on HECTOR etc.

The issue I have is one regarding what I can see are ssh-keys between PUMA and ARCHER. I have an issue submitting jobs, basically the password is requested after every submission and the password I enter is not deemed to be correct. I know it shouldn't ask for my password at this stage with ssh-keys added correctly, but I have tried re-adding these using your script and still no luck.

Many Thanks,

Chris

comment:8 Changed 5 years ago by ros

  • Owner changed from um_support to ros
  • Status changed from new to accepted

Hi Chris,

If you wish to do archiving then you need to turn the above switch back on and include a branch to add in the archiving code. See Archiving on ARCHER for instructions.

Regarding the ssh-keys, I would guess that the ssh-agent is not running properly. Try removing the file $HOME/.ssh/environment.puma on PUMA and then log out and log back in again. You will hopefully then see a message something along the lines of "re-initialising ssh-agent", you should then be able to run ssh-add successfully. This should then allow connections from PUMA to ARCHER without prompting for password/passphrase everytime.

Cheers,
Ros.

comment:9 Changed 5 years ago by webber24

Hi Ros,

I deleted the file in question, but to no avail, I feel that I may have to reset this completely at this point? Also I now cannot log into archer at all.

Also regarding the archiving, my question was referring to the fact that to get the resubmission working originally we turned off archiving, but if I wanted both features, would the job manage this? I have found the archiving on archer guide on http://cms.ncas.ac.uk/wiki/Archer/NercArchiving and have set up the passwordless ssh-key for archiving since the issues started, could this be the cause?

Cheers,
Chris

comment:10 Changed 5 years ago by ros

Hi Chris,

Did you then logout of PUMA and then back in again? What error message do you get when you run "ssh-add" on PUMA?

Yes you can do both automatic resubmission and archiving as long has you have the relevant archiving branch included. You will also need to make sure that if you are not recompiling that you switch on "Enable UM Scripts Build" in window compilation & run options → UM Scripts build the first time you resubmit the run.

As long as you have appended the um_arch.pub to the authorized_keys file this will not have any impact on your login from PUMA to ARCHER.

Cheers,
Ros.

comment:11 Changed 5 years ago by webber24

Hi Ros,

Thanks for clarifying my question regarding archiving. I get no error message (see below) when I input ssh-add and subsequently insert my archer password, but archer will not let me use this password to log in. Normally Archer will not ask me for a password once this ssh key has been added, but Archer does not even allow me entry with my original password.

ssh-add
Enter passphrase for /home/webber24/.ssh/id_dsa:
Identity added: /home/webber24/.ssh/id_dsa (/home/webber24/.ssh/id_dsa)

Cheers,

Chris

comment:12 Changed 5 years ago by ros

Hi Chris,

Ok. So ssh-add appears to have worked correctly. If you run "ssh-add -l" you should see something like

puma$ ssh-add -l
2048 77:92:01:36:df:25:54:94:1a:9d:f1:63:76:5f:be:9e /home/test/.ssh/id_dsa (RSA)

If ARCHER now asks for a password (which to be clear is not the passphrase you just entered above) then I would guess that your PUMA public key is not in your authorized_keys file on ARCHER.

To achieve this in a secure manner, do the following:

puma$ cat ~/.ssh/id_dsa.pub | ssh <username>@login.archer.ac.uk 'mkdir -p .ssh ; cat - >> ~/.ssh/authorized_keys'
[Enter your ARCHER password]

If it still doesn't work, then I can only suggest going through the setup instructions from scratch. See https://puma.nerc.ac.uk/trac/UM_TUTORIAL/wiki/Ros/sshAgent

Cheers,
Ros.

comment:13 Changed 5 years ago by webber24

Hi Ros,

Thanks for that, the suggested solution(s) work and I was able to submit a continuation run that archived! One final issue I have is that the run on it's initial 1 month configuration phase stops after 13 days and displays the error:


file_op: Copy: "history_archive/temp_hist.0024" to "/work/n02/n02/webber24/xkctt/xkctt.thist"
file_op: Copy: Complete, 47909 bytes
MPPIO: file op completed

U_MODEL:Failure writing main restart file
Check for problems and restart from temporary file
by overwriting main file with temporary file

????????????????????????????????????????????????????????????????????????????????
???!!!???!!!???!!!???!!!???!!!???!!! ERROR ???!!!???!!!???!!!???!!!???!!!???!!!?
? Error in routine: U_MODEL
? Error Code: 26
? Error Message: Temphist: Failed in OPEN of history file
? Error generated from processor: 0
? This run generated 962 warnings
????????????????????????????????????????????????????????????????????????????????

Do you have any idea what this error relates to. This has only cropped up since I added some stash, therefore leaving me thinking that the error is quota related? I checked the quota and although I have used a lot of disk space, I have only used 32% of my quota.

checked using (df -h .) in my work file i.e. /work/n02/n02/webber24

The .leave file for this run is:
xkctt000.xkctt.d14288.t181446.leave

within my output file.

Also: You may now see that the job has evolved to xkctt and no longer is xkctj

Any Thoughts?

Cheers,

Chris

comment:14 Changed 5 years ago by webber24

Problem solved, this ticket can now be closed!

Many Thanks for your help,

Chris

comment:15 Changed 5 years ago by ros

  • Resolution set to fixed
  • Status changed from accepted to closed
Note: See TracTickets for help on using tickets.