Opened 3 years ago

Closed 3 years ago

#2006 closed help (fixed)

Restart error after archiving failure

Reported by: ucfaako Owned by: um_support
Component: Archiving Keywords: restart; qshector_arch
Cc: Platform: ARCHER
UM Version: 6.6.3

Description

Hi,

my simulation xmfcl stopped following an archiving failure, similar to #1949 (they still happen, even with Grenville's patch, although less frequent now). When I tried to restart the run crashed immediately. The .leave file /home/n02/n02/akncas/um/umui_out/xmfcl000.xmfcl.d16295.t134733.archive returned:

/work/n02/n02/akncas/xmfcl/dataw/bin/qshector_arch[55]: : cannot open
/work/n02/n02/akncas/xmfcl/dataw/bin/qshector_arch[56]: : cannot open
/work/n02/n02/akncas/xmfcl/dataw/bin/qshector_arch[57]: : cannot open
cp: cannot create regular file `/xmfcl-295134723': Read-only file system

and further down

Archiving failure:restart file moved to
mv: missing destination file operand after `/xmfcl*'

I could start it from the last dump (it's only 2 months I'd loose) but maybe it's something easy to fix.

The previous .leave file is /home/n02/n02/akncas/um/umui_out/xmfcl003.xmfcl.d16294.t020012.archive, in case it's relevant.

Many thanks,
Alex

Change History (10)

comment:1 Changed 3 years ago by ucfaako

Hello again,

Did anyone had a chance to look at the .leave file? Am I potentially missing something?

Cheers,
Alex

comment:2 Changed 3 years ago by ros

Hi Alex,

ARCHER has been down since yesterday and is not due to return to service until the earliest tomorrow (Wednesday) evening so we have as yet been unable to look into this for you. There have also been problems with the RDF recently which may well have been the cause of your problem. Once ARCHER is back we will look into this for you.

Regards,
Ros.

comment:3 Changed 3 years ago by ucfaako

Hi Ros,

Sorry I wasn't aware of these issues, many thanks for letting me know and looking into this.

All the best,
Alex

comment:4 Changed 3 years ago by ros

Hi Alex,

If you haven't received the ARCHER mailings in the last couple of days I would recommend that you check your ARCHER email list settings. In ARCHER SAFE go to Your details → email list settings. I suggest subscribing to the Major Announcements, Service News and System Status notifications.

Cheers,
Ros.

comment:5 Changed 3 years ago by ros

  • Status changed from new to pending

Hi Alex,

Please try submitting your job again. The error messages above indicate that it was unable to ssh to the RDF. There were problems with the RDF the latter part of last week and it was taken down on Friday if not before (I can't remember exactly when they took it offline) hence why your restart failed immediately.

Regards,
Ros.

comment:6 Changed 3 years ago by ucfaako

Hi Ros,

Submitted it and it's now sitting in the queue, I'll let you know about the outcome.

Many thanks,
Alex

comment:7 Changed 3 years ago by ucfaako

Hi Ros,

The job run successfully, unfortunately only until another server failure. I resubmitted it, but this server issue seems to be very persistent. It really slows down the experiment, because of the long queuing time of each resubmit after every failure (the jobs are in 1 year chunks, ~8.5h requested, usually run at night) HadGEM2ES is basically making less than 1 year/day.

Is there anything you can do? The .leave file is in /home/n02/n02/akncas/um/umui_outxmfcl000.xmfcl.d16300.t152037.archive.

Any help is greatly appreciated!

Many thanks,
Alex

comment:8 Changed 3 years ago by ros

Hi Alex,

A couple of things:

Could you check that you can still ssh to espp1 with no prompting for password/passphrase. Assuming you did the standard setup try:

ssh -i ~/.ssh/um_arch espp1 ls /nerc

Also, the qshector_arch for xmfcl doesn't have Grenville's patch. Please take a copy of /work/n02/n02/grenvill/xglaz/bin/qshector_arch to replace the one you are currently using (/work/n02/n02/akncas/xmfcl/dataw/bin/qshector_arch).

Remember, if you do a full rebuild at any point or copy the job, you'll need to copy the script again (until we get it in the archiving branch).

Cheers,
Ros.

comment:9 Changed 3 years ago by ucfaako

Hi Ros,

many thanks for your quick reply. ssh to espp1 works, I must have forgotten to copy Grenville's patch (thought I already copied it…)! Many thanks for pointing that out! Let's hope it runs smoothly now.

Cheers,
Alex

comment:10 Changed 3 years ago by ros

  • Resolution set to fixed
  • Status changed from pending to closed
Note: See TracTickets for help on using tickets.