Opened 4 years ago

Closed 4 years ago

#1547 closed help (fixed)

authentification error and job not resubmitting

Reported by: lboljka Owned by: ros
Component: UM Model Keywords:
Cc: Platform: ARCHER
UM Version: 8.6

Description

ae month and then does not resubmit as it is supposed to (I am clueless as to why this is happening as .leave files do not seem to give me errors). I ran it with time limit of 5000s and 60000s but no change anyway. Then I created a new job without automatic resubmission (job xkwjk/xkwjc (they are the same)), but the model only ran for 2 months or so. This time around I did not receive any .rfc.leave or .leave files, only .comp.leave.

Now for some reason I cannot open any files on Archer due to some authentication fault (before Thursday I did not have this problem at all), e.g. when using emacs to open .comp.leave file:

X11 connection rejected because of wrong authentication.
X11 connection rejected because of wrong authentication.
Display localhost:19.0 unavailable, simulating -nw

And today I had to enter passphrase to enter Archer as well.

I am clueless as to what is happening at the moment and would therefore appreciate your help.

Thank you in advance.

Best wishes

Lina

Change History (8)

comment:1 Changed 4 years ago by ros

  • Owner changed from lboljka to um_support
  • Status changed from new to assigned

comment:2 Changed 4 years ago by ros

  • Owner changed from um_support to ros
  • Status changed from assigned to accepted

Hi Lina,

The problem you have opening files on ARCHER is due to running out of disk space. You may wish to have a clear out of any files you no longer need, but I have increased your space on /home, however it will take a little while to come into effect.

If you need to enter your passphrase to login to ARCHER from PUMA this indicates that your ssh key has become detached from the ssh-agent. Try running ssh-add on PUMA. If you get the error message Could not open a connection to your authentication agent. Please follow the instructions on the FAQ: http://cms.ncas.ac.uk/wiki/FAQ_T4_F5

I will have a look at your .leave files shortly.

Regards,
Ros.

comment:3 Changed 4 years ago by ros

Hi Lina,

The problems with missing .leave files for xkwjk and xkwjc will be due to the disk space issue.

I have taken a look at the most recent .leave file for xkwjr (xkwjr000.xkwjr.d15109.t210438.leave) and this indicates that it was for an NRUN. To continue with the CRUN you will need to go to Compilation & Run options → Compile & Run options for atmosphere and switch off the compilation of the model and reconfiguration. Switch on Continuation Run and Save, Process, Submit

Hope this helps.
Regards,
Ros.

comment:4 Changed 4 years ago by lboljka

Hi Ros

I will try those out and let you know if I have any more problems (it will take some time for jobs to run). As for deleting files, I think I do it from time to time anyway, so for now I believe I have only those files I need.

Thank you for everything.

Best wishes

Lina

comment:5 Changed 4 years ago by lboljka

Hi Ros

I have tried running the jobs xkwjk & xkwjr again, and it seemed like it was running xkwjr ok, but was running out of space in files (.pa,.pb,…) and I am not sure what to do with those. I tried increasing the size of files in UM's postprocessing part (but they are at maximum now) so I am not sure what to do… I do need 6H data for 1Year (later on I might need it for 10 years), so I am not sure what to do or how to separate the fields.
After increasing the size of files I still got error (code 400; Failure writing out field for xkwjk) and no error for xkwjr, but clearly it did not run properly anyway…

Thank you.

Best wishes

Lina

comment:6 Changed 4 years ago by ros

Hi Lina,

You have run out of disk space on /work. See the .leave file for xkwjk: BUFFOUT: Write Failed: Disk quota exceeded. I have increased your disk space on /work, this will be applied in a few hours. You can check the details in ARCHER safe.

I notice that you are writing restart dumps every 10 days, these will mount up and are ~1.5Gb each. Do you really need to be writing them at this frequency, especially if you will be running much longer runs? If it is necessary, that's fine, I just wanted to make you have thought about the amount of data you will be generating and whether it is all needed.

Regards,
Ros.

comment:7 Changed 4 years ago by lboljka

Hi Ros

Ahh yes the dumps do not have to be there every 10 days, I guess I am still learning how to deal with all this. I had it on 10 days as I was running 11 day runs first to make model work.

I will try rerunning with little changes later again and remove all files from /work that I do not need (from old jobs). Hopefully I will not have problems with .pa, .pb etc files…

Thank you again!

Best wishes

Lina

comment:8 Changed 4 years ago by ros

  • Resolution set to fixed
  • Status changed from accepted to closed
Note: See TracTickets for help on using tickets.