#1656 closed help (answered)

qsserver errors

Reported by: eelrm Owned by: willie
Priority: highest Component: UM Model
Keywords: archiving, SSH Cc: gmann
Platform: ARCHER UM Version: 8.4

Description

Hi,

I’ve been trying to complete a 10 year run (xlkpe) on ARCHER but keep getting qsserver failures. After the first time this happened (xlkpd) (after 4 years), I restarted the job and managed to get a couple more years output before getting the same error:

T qsserver failure at Thu Sep 10 21:51:27 BST 2015
qscasedisp: return code after calling qshector_arch RCARC=2

T qsserver failure at Thu Sep 10 22:12:58 BST 2015
0+1 records in
0+1 records out
1810 bytes (1.8 kB) copied, 0.00150279 s, 1.2 MB/s
Archiving failure:restart file moved to
mv: missing destination file operand after `/xlkpe*'

Is this something that will keep happening due to the archiving or is there something else I should do?

Thank you,

Lauren

Change History (12)

comment:1 Changed 20 months ago by willie

  • Keywords archiving added

Hi Lauren,

It looks like something's gone wrong with archiving earlier:

/work/n02/n02/eelrm/xlkpe/bin/qshector_arch[56]: : cannot open

Ensure that you have followed the instructions at http://cms.ncas.ac.uk/wiki/Archer/NercArchiving especially on setting up SSH.

Later in the leave file you get

ERROR EXPPXI: INVALID row VALUE:     0
Im_ident,Sec,Item:     1   34   50

which looks like some problem with PP file conversion.

Regards,

Willie

comment:2 Changed 19 months ago by eelrm

Hi Willie,

I re-did the archiving and SSH instructions and re-submitted the job but it has failed again. I have the ff2pp archiving branch?

Thanks,

Lauren

comment:3 Changed 19 months ago by willie

Hi Lauren,

Looking at xlkpf, nothing has been archived, but then there is nothing in the output PP files. I think this is due to the "missing" STASH item in section 34, item 50. This doesn't exist at UM8.4 and I can't see where it is being included. Do you have any ideas?

Regards

Willie

comment:4 Changed 19 months ago by mdalvi

Hi Lauren,

The request for 34050 comes from my handedit /home/mdalvi/umui_jobs/hand_edits/vn8.4/add_ukca_eval1_diags_l85.ed.
However, this hand-edit is used in all vn8.4 jobs (IBM,Cray,Archer) but I have not come across the archiving/ qsserver issue before.

In any case, a copy of the hand-edit without this item is now available /home/mdalvi/umui_jobs/hand_edits/vn8.4/add_ukca_eval1_diags_l85_fix.ed, so you can use this and see if the problem is resolved.

Regards


Mohit

comment:5 Changed 19 months ago by willie

Hi Lauren,

Do you still need help with this?

Regards

Willie

comment:6 Changed 19 months ago by eelrm

Hi,

The new hand-edit did not resolve the problem unfortunately (xlkpf). I've tried running without the ff2pp branch, but the only way I can run anything is with turning the automatic post processing off and as I am planning a large number of simulations in the future, this doesn't seem like a good option.

Graham Mann is able to run xlrfk fine, but my identical copy (xlwbb) failed after 5 mins. Could there be another reason why this is crashing so early for me?

Thank you,

Lauren

comment:7 Changed 19 months ago by willie

  • Keywords archiving, SSH added; archiving removed
  • Owner changed from um_support to willie
  • Status changed from new to accepted

Hi Lauren,

As you say, xlrfk and xlwbb are identical. Therefore the problem is in the setup of the SSH for the archiving. Try the following on ARCHER,

  cd ~
  mv .ssh .ssh_old
  mkdir .ssh
  chmod og-srx .ssh
  cd .ssh

  ssh-keygen -f um_arch

  cat um_arch.pub >> authorized_keys

remembering to hit return twice at the keygen stage. This will give a clean authorized_keys file. If this works then you can delete .ssh_old.

Regards,

Willie

Last edited 19 months ago by willie (previous) (diff)

comment:8 Changed 19 months ago by willie

Hi Lauren,

You can test if you've set it up correctly by doing, on ARCHER,

ssh -i ~/.ssh/um_arch espp1 ls /nerc

which should give a list of files and directories which include n02.

Regards

Willie

comment:9 Changed 19 months ago by eelrm

Hi Willie,

Okay, thank you. Before doing anything I had a check and n02 was included when I ran ssh -i ~/.ssh/um_arch espp1 ls /nerc. I followed your instructions however and confirm that n02 is listed. However, as I had then lost my puma-archer ssh key I re-did the instructions here: http://puma.nerc.ac.uk/trac/UM_TUTORIAL/wiki/Ros/sshAgent. Would this have messed things up?

I'll try submitting another job.

Thanks,

Lauren

comment:10 Changed 19 months ago by eelrm

Hi Willie,

I resubmitted an identical job (xlkph) to one that had previously completed its NRUN (xlkpd). This time the job ran for 16 hours, but again encountered a qsserver error and so did not complete. Any ideas?

Thank you,

Lauren

comment:11 Changed 19 months ago by willie

Hi Lauren,

It has run for 16560 time steps (230 days) and been archiving nicely up to about 221 days. Then we have

cp: cannot stat `/work/n02/n02/eelrm/tmp/tmp.mom2.2868/xlkpha.pb19910710.pp': Input/output error

this takes the model down and removes $UM_TMPDIR which then results in the "cannot open" messages (because the qshector_arch script is trying to create an email there).

I notice that the pb stream files are always empty, so no need to generate them. This particular file isn't in the archive and it is not on /work either. Do you know if anything special happens at this date in the model?

I can only advise that you take a copy of xlkph and repeat the run to see if it fails in the same place.

Regards,

Willie

comment:12 Changed 18 months ago by willie

  • Resolution set to answered
  • Status changed from accepted to closed
Note: See TracTickets for help on using tickets.