Opened 3 years ago

Closed 3 years ago

#1949 closed help (answered)

Archiving failure at timestep

Reported by: ucfaako Owned by: ros
Component: Archiving Keywords: archiving
Cc: Platform: ARCHER
UM Version: 6.6.3

Description

Hello,

my run xmfck stops after running successfully up to a certain timestep (April 2015) with a server failure when it attempts to archive the output of this timestep to /nerc/.
I have tried to resubmit it as well as to start the simulation from an earlier point (although with start dumps created by the model), but the problem persists.
Archiving works up until this point so I didn't expect it to be a problem with the archivig setup per se but I checked sshing into /nerc/ works fine.

The output is too large but can be found in /home/n02/n02/akncas/um/umui_out/xmfck000.xmfck.d16229.t175957.archive.

Many thanks for any help!

Best,
Alex

Change History (12)

comment:1 Changed 3 years ago by ros

Hi Alex,

There are 3 error messages at the top of the output:

/work/n02/n02/akncas/xmfck/dataw/bin/qshector_arch[55]: : cannot open
/work/n02/n02/akncas/xmfck/dataw/bin/qshector_arch[56]: : cannot open
/work/n02/n02/akncas/xmfck/dataw/bin/qshector_arch[57]: : cannot open

Lines 55-57 of the qshector_arch script are outputting error messages resulting from a problem ssh'ing to the /nerc archive. Not sure why it couldn't open the message file, that's another matter, but indications are that it couldn't get to the RDF. Can I just confirm that when you say "sshing into /nerc works fine" you are NOT prompted for any password or passphrase?

Regards,
Ros.

comment:2 Changed 3 years ago by ucfaako

Hi Ros,

Correct, I tried ssh -i ~/.ssh/um_arch espp1 ls /nerc as suggested in #1656, which returned a list of directories (n01, n02, etc.). It also seems to appear only at this timestep. Everything prior to April 2015 archives fine.

Thanks,
Alex

comment:3 Changed 3 years ago by ros

  • Owner changed from um_support to ros
  • Status changed from new to accepted

Hi Alex,

It's very difficult to see what's gone on here, I think some changes were made to the job between the NRUN & CRUN and your NRUN was a totally different length to what you've asked for the CRUN chunks which can cause problems.

Could you please try submitting the NRUN again, leaving the resubmission period as a year and see if it gets over the problem area.

Let us know what happens and if it fails please don't make any changes to the UMUI until we've taken a look.

Regards,
Ros.

comment:4 Changed 3 years ago by ucfaako

Hi Ros,

The NRUN completed successfully and I've just submitted the CRUN, thank you for your help. I will let you know if there are any further archiving problems.

I thought that it would somehow work to use a shorter NRUN and longer CRUNs - I think we did something like that in our climate modelling summer school, but having NRUN and CRUNs of the same length makes more sense to me too.

Many thanks,
Alex

comment:5 Changed 3 years ago by ucfaako

Hi Ros,

The CRUN failed again archiving at a certain timestep on (June). The output can be found in /home/n02/n02/akncas/um/umui_out/xmfck000.xmfck.d16232.t080225.archive.

Many thanks,
Alex

comment:6 Changed 3 years ago by ucfaako

Sorry to bug again, have you or anyone else had a chance to look at the archiving issue yet?

Many thanks,
Alex

comment:7 Changed 3 years ago by ros

Hi Alex,

Not been able to work this one out yet I'm afraid. I'll probably need to run your job, which will have to wait until tomorrow as the RDF is down today.

Regards,
Ros.

comment:8 Changed 3 years ago by ucfaako

Hi Ros,

thank you for the update. That's totally fine, let me know if I can be of any help.

Best,
Alex

comment:9 Changed 3 years ago by grenville

Alex

Please take a copy of /work/n02/n02/grenvill/xglaz/bin/qshector_arch to replace the one you are currently using (/work/n02/n02/akncas/xmfck/dataw/bin/qshector_arch). This version retries a few times if the ssh to espp1 fails for any reason. I'm not sure this will solve the problem you are seeing but it's worth a try.

Remember, if you do a full rebuild at any point, you'll need to copy the script again (until we get it in the archiving branch).

Grenville

comment:10 Changed 3 years ago by ucfaako

Thank you, Grenville. I've copied the script and restarted the run. I'll let you know if it works.

Best,
Alex

comment:11 Changed 3 years ago by ucfaako

Everything worked well, thanks again Grenville.

One quick question - I now need to change one ancillary file path (one of the ancils is split into 20 year chunks), what would be best practice to do this? Creating a copy of the run and build a new executable and new reconfiguration? Or is it possible to continue with the existing run & executables?

Many thanks,
Alex

comment:12 Changed 3 years ago by ros

  • Resolution set to answered
  • Status changed from accepted to closed

Hi Alex,

If it were me I would probably take a copy of the run and change the ancillary file path and rather than rebuild the executables I would point the new job to the executables from the previous run. (See Sub-Model Indep → Compilations and modifications → Compile options for model and similarly modifications for the reconfiguration)

Regards,
Ros.

Note: See TracTickets for help on using tickets.