Opened 4 years ago
Closed 4 years ago
#2006 closed help (fixed)
Restart error after archiving failure
Reported by: | ucfaako | Owned by: | um_support |
---|---|---|---|
Component: | Archiving | Keywords: | restart; qshector_arch |
Cc: | Platform: | ARCHER | |
UM Version: | 6.6.3 |
Description
Hi,
my simulation xmfcl stopped following an archiving failure, similar to #1949 (they still happen, even with Grenville's patch, although less frequent now). When I tried to restart the run crashed immediately. The .leave file /home/n02/n02/akncas/um/umui_out/xmfcl000.xmfcl.d16295.t134733.archive returned:
/work/n02/n02/akncas/xmfcl/dataw/bin/qshector_arch[55]: : cannot open /work/n02/n02/akncas/xmfcl/dataw/bin/qshector_arch[56]: : cannot open /work/n02/n02/akncas/xmfcl/dataw/bin/qshector_arch[57]: : cannot open cp: cannot create regular file `/xmfcl-295134723': Read-only file system
and further down
Archiving failure:restart file moved to mv: missing destination file operand after `/xmfcl*'
I could start it from the last dump (it's only 2 months I'd loose) but maybe it's something easy to fix.
The previous .leave file is /home/n02/n02/akncas/um/umui_out/xmfcl003.xmfcl.d16294.t020012.archive, in case it's relevant.
Many thanks,
Alex
Change History (10)
comment:1 Changed 4 years ago by ucfaako
comment:2 Changed 4 years ago by ros
Hi Alex,
ARCHER has been down since yesterday and is not due to return to service until the earliest tomorrow (Wednesday) evening so we have as yet been unable to look into this for you. There have also been problems with the RDF recently which may well have been the cause of your problem. Once ARCHER is back we will look into this for you.
Regards,
Ros.
comment:3 Changed 4 years ago by ucfaako
Hi Ros,
Sorry I wasn't aware of these issues, many thanks for letting me know and looking into this.
All the best,
Alex
comment:4 Changed 4 years ago by ros
Hi Alex,
If you haven't received the ARCHER mailings in the last couple of days I would recommend that you check your ARCHER email list settings. In ARCHER SAFE go to Your details → email list settings. I suggest subscribing to the Major Announcements, Service News and System Status notifications.
Cheers,
Ros.
comment:5 Changed 4 years ago by ros
- Status changed from new to pending
Hi Alex,
Please try submitting your job again. The error messages above indicate that it was unable to ssh to the RDF. There were problems with the RDF the latter part of last week and it was taken down on Friday if not before (I can't remember exactly when they took it offline) hence why your restart failed immediately.
Regards,
Ros.
comment:6 Changed 4 years ago by ucfaako
Hi Ros,
Submitted it and it's now sitting in the queue, I'll let you know about the outcome.
Many thanks,
Alex
comment:7 Changed 4 years ago by ucfaako
Hi Ros,
The job run successfully, unfortunately only until another server failure. I resubmitted it, but this server issue seems to be very persistent. It really slows down the experiment, because of the long queuing time of each resubmit after every failure (the jobs are in 1 year chunks, ~8.5h requested, usually run at night) HadGEM2ES is basically making less than 1 year/day.
Is there anything you can do? The .leave file is in /home/n02/n02/akncas/um/umui_outxmfcl000.xmfcl.d16300.t152037.archive.
Any help is greatly appreciated!
Many thanks,
Alex
comment:8 Changed 4 years ago by ros
Hi Alex,
A couple of things:
Could you check that you can still ssh to espp1 with no prompting for password/passphrase. Assuming you did the standard setup try:
ssh -i ~/.ssh/um_arch espp1 ls /nerc
Also, the qshector_arch for xmfcl doesn't have Grenville's patch. Please take a copy of /work/n02/n02/grenvill/xglaz/bin/qshector_arch to replace the one you are currently using (/work/n02/n02/akncas/xmfcl/dataw/bin/qshector_arch).
Remember, if you do a full rebuild at any point or copy the job, you'll need to copy the script again (until we get it in the archiving branch).
Cheers,
Ros.
comment:9 Changed 4 years ago by ucfaako
Hi Ros,
many thanks for your quick reply. ssh to espp1 works, I must have forgotten to copy Grenville's patch (thought I already copied it…)! Many thanks for pointing that out! Let's hope it runs smoothly now.
Cheers,
Alex
comment:10 Changed 4 years ago by ros
- Resolution set to fixed
- Status changed from pending to closed
Hello again,
Did anyone had a chance to look at the .leave file? Am I potentially missing something?
Cheers,
Alex