Opened 18 months ago

Closed 18 months ago

Last modified 6 months ago

#2422 closed help (fixed)

RDF maintenance and data transfer from /work

Reported by: mvguarino Owned by: um_support
Component: Coupled model Keywords:
Cc: Platform: ARCHER
UM Version: 10.7

Description

Hello,

ARCHER announced that next week (3 - 9 March) the RDF, DAC and JASMIN services will be unavailable for maintenance.
I have a couple of simulations running on Archer and transferring files to the RDF, however the automatic data transfer from /work to /nerc won’t work during the next days.
What is it going to happen to files produced during the maintenance period? will they automatically be transferred once the RDF is back online ?
or maybe I should do something to prevent data loss ( ? ).

Thank you,

Vittoria

Change History (23)

comment:1 Changed 18 months ago by grenville

Vittoria

Your suite should handle the RDF down time - the post processing will fail at some point; the model will continue to run but Cylc will prevent it running too far ahead. When the RDF comes back, you need only retrigger the failed tasks and the suite will continue.

Grenville

comment:2 Changed 18 months ago by mvguarino

Hi Grenville,

Thanks.

Not sure I can use this ticket for a second problem I have just got, but I will try:

both the suites I am running stopped yesterday with the following error:

2018-02-27T15:54:34Z ERROR - [jobs-poll cmd] cylc jobs-poll --host=login5.archer.ac.uk --user=vittoria -- '$HOME/cylc-run/u-au022/log/job' 18920101T0000Z/coupled/01
	[jobs-poll ret_code] 1
	[jobs-poll err]
	--------------------------------------------------------------------------------
	This is a private computing facility. Access to this service is limited to those
	who have been granted access by the operating service provider on behalf of the
	contracting authority and use is restricted to the purposes for which access was
	granted. All access and usage are governed by the terms and conditions of access
	agreed to by all registered users and are thus subject to the provisions of the
	Computer Misuse Act, 1990 under which unauthorised use is a criminal offence.
	
	If you are not authorised to use this service you must disconnect immediately.
	--------------------------------------------------------------------------------

I thought it was a ssh-agent problem so I did this http://cms.ncas.ac.uk/wiki/FAQ_T4_F5 and this http://cms.ncas.ac.uk/wiki/RoseCylc/Hints#Settinguprosehost-selectarcher .

When I run the setup-archer-hosts script everything seems to be fine:

mvguarino@puma:/home/mvguarino> ~um/um-training/setup-archer-hosts
Connecting to ARCHER hosts...
Connected to login1.archer.ac.uk
Connected to login2.archer.ac.uk
Connected to login3.archer.ac.uk
Connected to login4.archer.ac.uk
Connected to login5.archer.ac.uk
Connected to login6.archer.ac.uk
Connected to login7.archer.ac.uk
Connected to login8.archer.ac.uk
Connected to login.archer.ac.uk

But rose host-select archer doesn’t work:

mvguarino@puma:/home/mvguarino> rose host-select archer           
[WARN] login8.archer.ac.uk: (ssh failed)
[WARN] login7.archer.ac.uk: (ssh failed)
[WARN] login4.archer.ac.uk: (ssh failed)
[WARN] login1.archer.ac.uk: (ssh failed)
[WARN] login2.archer.ac.uk: (ssh failed)
[WARN] login6.archer.ac.uk: (ssh failed)                                     
login.archer.ac.uk 

As a result suites keep on trying to submit tasks without succeeding.

Thanks,

Vittoria

comment:3 Changed 18 months ago by grenville

Vittoria

I don't know why rose-host-select has stopped working. You could try changing

host = $(rose host-select archer)

to

host = login.archer.ac.uk

in site/archer.rc

until we figure out what is happening.

Grenville

comment:4 Changed 18 months ago by mvguarino

Hi,

I could submit my runs.
Let me know when rose-host-select will work again,

thank you

Vittoria

comment:5 Changed 18 months ago by ros

Hi Vittoria,

rose host-select is working on PUMA. Please check that it is still not working and if so it must be something in your setup which we will need to figure out.

Can you please try logging into the failed nodes above and check that you are not being prompted for any user input?

E.g.
ssh <username>@login8.archer.ac.uk

Cheers,
Ros.

comment:6 Changed 18 months ago by mvguarino

Hi Ros,

If I run the rose host-select command this today seems to work, but not always:

mvguarino@puma:/home/mvguarino> ~um/um-training/setup-archer-hosts                                              Connecting to ARCHER hosts...                                                                                   Connected to login1.archer.ac.uk
Connected to login2.archer.ac.uk
Connected to login3.archer.ac.uk
Connected to login4.archer.ac.uk
Connected to login5.archer.ac.uk
Connected to login6.archer.ac.uk
Connected to login7.archer.ac.uk
Connected to login8.archer.ac.uk
Connected to login.archer.ac.uk
mvguarino@puma:/home/mvguarino> rose host-select archer           
login5.archer.ac.uk
mvguarino@puma:/home/mvguarino> rose host-select archer
[WARN] login3.archer.ac.uk: (ssh failed)
[WARN] login7.archer.ac.uk: (ssh failed)                                                                     
[WARN] login1.archer.ac.uk: (ssh failed)
login5.archer.ac.uk

I logged into login3.archer without the need to typing password or passphrase, I was asked if I was sure I wanted to continue connecting though.

Also, although I could submit the suites the coupled task is failing for both of them with the following error:

Rank 678 [Thu Mar  1 04:52:30 2018] [c7-2c2s15n0] application called MPI_Abort(comm=0xC4000009, 1) - process 672
Rank 695 [Thu Mar  1 04:52:30 2018] [c7-2c2s15n0] application called MPI_Abort(comm=0xC4000003, 1) - process 689
_pmiu_daemon(SIGCHLD): [NID 04604] [c7-2c2s15n0] [Thu Mar  1 04:52:30 2018] PE RANK 695 exit signal Aborted
Rank 726 [Thu Mar  1 04:52:30 2018] [c0-3c0s1n3] application called MPI_Abort(comm=0xC4000009, 1) - process 720
[NID 04604] 2018-03-01 04:52:30 Apid 30147031: initiated application termination
[FAIL] run_model # return-code=137
Received signal ERR
cylc (scheduler - 2018-03-01T04:52:41Z): CRITICAL Task job script received signal ERR at 2018-03-01T04:52:41Z
cylc (scheduler - 2018-03-01T04:52:41Z): CRITICAL failed at 2018-03-01T04:52:41Z

And this is what the log viewer shows:

ERROR: remote command failed 255
2018-03-01T09:05:48Z ERROR - [jobs-poll cmd] cylc jobs-poll --host=login.archer.ac.uk --user=vittoria -- '$HOME/cylc-run/u-au022/log/job' 18920101T0000Z/coupled/04
	[jobs-poll ret_code] 1
	[jobs-poll err]
	Host key verification failed.
	ERROR: remote command failed 255



Last edited 18 months ago by mvguarino (previous) (diff)

comment:7 Changed 18 months ago by ros

I logged into login3.archer without the need to typing password or passphrase, I was asked if I was sure I wanted to continue connecting though.

Yes, I think this will be why rose host-select is failing on some nodes. Once you've said "yes" to the question of connecting for each node it should then be ok.

Cheers,
Ros.

comment:8 Changed 18 months ago by mvguarino

Hi Ros,

I went through all the nodes but, unfortunately, I still get ‘ssh failed’ for some of the nodes.
Something must have changed because everything was working fine until the day before yesterday.

At the same time, there is a coupled task submitted 2 days ago (when all of this started) whose status is still ‘running’ on rose-bush (http://puma.nerc.ac.uk/rose-bush/taskjobs/mvguarino/u-au022?cycles=18920101T0000Z ). This task is not displayed among my jobs on Archer (qstat -u) and indeed it shouldn’t be as I deleted that job. Should this be a concern or is just a rose-bush related issue?

Thanks,

Vittoria

comment:9 Changed 18 months ago by ros

Hi Vittoria,

As long as rose host-select is finding an ok node it doesn't really matter.

However, I've got a problem with login3.archer.ac.uk. I assume the connecting message is the same as yours:

ros@puma$ ssh login3.archer.ac.uk
Warning: the ECDSA host key for 'login3.archer.ac.uk' differs from the key for the IP address '193.62.216.44'
Offending key for IP in /home/ros/.ssh/known_hosts:66
Matching host key in /home/ros/.ssh/known_hosts:63
Are you sure you want to continue connecting (yes/no)? 

You will need to remove the offending key from your known_hosts file as indicated and then you will be able to connect to the node without any further input.

For the suite still showing a task as running cd to the ~/roses/suiteid directory and run rose sgc to bring up the cylc GUI. If the task is still showing as running try polling to get it to update. If the suite is no longer running. Don't worry about it - it's doing no harm.

Cheers,
Ros.

comment:10 Changed 18 months ago by mvguarino

Hi Ros,
I didn’t see the warning, thanks.
I removed the offending keys for all the nodes and now I can login without the need of any user input.
One of the suites is still queuing the other one however (u-au022) started to run and failed:

2018-03-01T14:59:15Z WARNING - suite stalled
2018-03-01T14:59:15Z WARNING - Unmet prerequisites for postproc_atmos.18920101T0000Z:
2018-03-01T14:59:15Z WARNING -  * coupled.18920101T0000Z succeeded
2018-03-01T14:59:15Z WARNING - Unmet prerequisites for postproc_cice.18920101T0000Z:
2018-03-01T14:59:15Z WARNING -  * coupled.18920101T0000Z succeeded
2018-03-01T14:59:15Z WARNING - Unmet prerequisites for postproc_nemo.18920101T0000Z:
2018-03-01T14:59:15Z WARNING -  * coupled.18920101T0000Z succeeded
2018-03-01T14:59:15Z WARNING - Unmet prerequisites for coupled.18920201T0000Z:
2018-03-01T14:59:15Z WARNING -  * coupled.18920101T0000Z succeeded

Do you have any advice on how overcoming this issue?

Thanks,
Vittoria

comment:11 Changed 18 months ago by ros

It's crashed with an MPI error see job.err but as yet I don't know why.

Cheers,
Ros.

comment:12 Changed 18 months ago by mvguarino

Yeah, I saw that but the error doesn't say much to me.
The suite crashed with the same (vague) error when I was having the host problem. But I assume that has been solved.

Vittoria

comment:13 Changed 18 months ago by mvguarino

Hi Ros,
I figured both the simulations I was running (u-as245 and u-au022) were at the end of a cycle when they crashed because of the host problem.
Most of the expected model outputs (as far as I can see) were created and copied to the RDF: for u-as245 all the outputs seem to be there; for u-au022, instead, the NEMO outputs are missing on the RDF.

As trying to restart the suites from the cycle points where they failed produced a MPI abort error, I tried the following:

  • Restart u-as245 from 19200301T0000Z (assuming the cycle 19200201T0000Z ran successfully, as all the outputs are there). But I get this error:
Traceback (most recent call last):
  File "./link_drivers", line 183, in <module>
    envinsts, launchcmds = _run_drivers(common_envars, mode)
  File "./link_drivers", line 66, in _run_drivers
    '(common_envars,\'%s\')' % (drivername, mode)
  File "<string>", line 1, in <module>
  File "/fs2/n02/n02/vittoria/cylc-run/u-as245/work/19200301T0000Z/coupled/nemo_driver.py", line 648, in run_driver
    exe_envar = _setup_executable(common_envar)
  File "/fs2/n02/n02/vittoria/cylc-run/u-as245/work/19200301T0000Z/coupled/nemo_driver.py", line 568, in _setup_executable
    controller_mode)
  File "/fs2/n02/n02/vittoria/cylc-run/u-as245/work/19200301T0000Z/coupled/top_controller.py", line 370, in run_controller
    nemo_dump_time)
  File "/fs2/n02/n02/vittoria/cylc-run/u-as245/work/19200301T0000Z/coupled/top_controller.py", line 248, in _setup_top_controller
    % top_dump_time % nemo_dump_time)
TypeError: not enough arguments for format string
[FAIL] run_model # return-code=1
Received signal ERR
cylc (scheduler - 2018-03-05T11:51:21Z): CRITICAL Task job script received signal ERR at 2018-03-05T11:51:21Z
cylc (scheduler - 2018-03-05T11:51:21Z): CRITICAL failed at 2018-03-05T11:51:21Z
  • Set the status of coupled.18920101T0000Z to ‘succeeded’ and run the post processing.

Postproc_atmos and _cice run fine but the problem seems to be the NEMO post processing:

[WARN]  [SUBPROCESS]: Command: python2.7 /work/n02/n02/vittoria/cylc-run/u-au022/share/fcm_make_pp/build/bin/icb_pp.py -t /work/n02/n02/vittoria/cylc-run/u-au022/share/data/History_Data/NEMOhist/trajectory_icebergs_18920101-18920201_ -n 72 -o /work/n02/n02/vittoria/cylc-run/u-au022/share/data/History_Data/NEMOhist/au022o_trajectory_icebergs_18920101-18920201.nc
[SUBPROCESS]: Error = 1:
	Traceback (most recent call last):
  File "/work/n02/n02/vittoria/cylc-run/u-au022/share/fcm_make_pp/build/bin/icb_pp.py", line 82, in <module>
    icu = np.concatenate(icu)
ValueError: need at least one array to concatenate

[ERROR]  icb_pp: Error=1
	Traceback (most recent call last):
  File "/work/n02/n02/vittoria/cylc-run/u-au022/share/fcm_make_pp/build/bin/icb_pp.py", line 82, in <module>
    icu = np.concatenate(icu)
ValueError: need at least one array to concatenate

 -> Failed to rebuild file: trajectory_icebergs_18920101-18920201
[FAIL]  Command Terminated
[FAIL] Terminating PostProc...
[FAIL] main_pp.py nemo # return-code=1
Received signal ERR
cylc (scheduler - 2018-03-03T22:16:49Z): CRITICAL Task job script received signal ERR at 2018-03-03T22:16:49Z
cylc (scheduler - 2018-03-03T22:16:49Z): CRITICAL failed at 2018-03-03T22:16:49Z

I don’t know what is causing these problems, so I was thinking to restart both suites as NRUNs using archived restart files, hoping that would solve everything.
As that is quite drastic, is there anything else you would suggest me to do before proceeding with a NRUN?

Thanks,

Vittoria

comment:14 Changed 18 months ago by grenville

Vittoria

Did you do what it says to do at the end of

/home/n02/n02/vittoria/cylc-run/u-as245/log/job/19200301T0000Z/coupled/01/job.out ?

There are instructions on how to restart coupled jobs here

http://collab.metoffice.gov.uk/twiki/bin/view/Project/HiResCL/HiResCLPRIMSimulations

Grenville

comment:15 Changed 18 months ago by mvguarino

Hi Grenville,

I followed those instructions to restart the suite from January, as I thought that may be a way around the problem (but as a result I got again the generic MIP error). As a consequence, I deleted any NEMO restart files later than 1900101 … which makes me realize what I was trying to do in this last attempt couldn’t work anyway.

Any idea of what is causing this?
http://puma.nerc.ac.uk/rose-bush/view/mvguarino/u-au022?&no_fuzzy_time=0&path=log/job/18920101T0000Z/postproc_nemo/01/job.err

It can’t concatenate, it says, but all the trajectory_icebergs_* files are there…

Vittoria

comment:16 Changed 18 months ago by ros

Hi Vittoria,

Having consulted with some NEMO folks we think the most likely cause of this is that there are no icebergs and the python script fails in this case. There is a fix for this which I'm just trying to extract so that you can simply copy the file into place.

Cheers,
Ros.

comment:17 Changed 18 months ago by mvguarino

Hi Ros,

Thank you.
I was just looking, in ~/cylc-run/u-au022/work/18920101T0000Z/coupled, at the size of the icebergs.stat_* files. It looks quite small compared to the icebergs files of the previous cycle.

Vittoria

comment:18 Changed 18 months ago by mvguarino

Hi,

While waiting to solve the issue with u-au022:

u-as245 is now running, the problem was solved as follows:

  • restarting the simulation from cycle 19200101T0000Z - as245.xhist, ice.restart_file and the NEMO dumps were adjusted accordingly.
  • overwriting the NEMO namelist_cfg file with the namelist_cfg of the previous cycle:

cp ~/cylc-run/u-as245/work/19191201T0000Z/coupled/namelist_cfg ~cylc-run/u-as245/share/data/History_Data/NEMOhist/namelist_cfg

The wrong namelist_cfg produced a generic MPI error in job.err and a more detailed error in ~/cylc-run/u-as245/work/19200101T0000Z/coupled/ocean.output:

 ===>>> : E R R O R
         ===========

  ===>>>> : problem with nittrc000 for the restart
  verify the restart file or rerun with nn_rsttr = 0 (namelist)
.
.
.
 ===>>> : E R R O R
         ===========

 STOP
 Critical errors in NEMO initialisation
 huge E-R-R-O-R : immediate stop  

This may also be the reason why I couldn’t re-run the coupled task of u-au022.

Vittoria

comment:19 Changed 18 months ago by ros

Hi Vittoria,

Sorry for the delay.

Please copy the file ~ros/temp/icb_pp.py into your ~/cylc-run/u-au022/share/fcm_make_pp/build/bin directory on ARCHER. And re-trigger the failed postproc task and hopefully that will fix the problem.

If you want to include this fix in future suites you will need to set the following in fcm_make_pp → Configuration

config_base: fcm:moci.xm-br/dev/davestorkey/postproc_2.2_iceberg_update
config_rev: @2477
pp_rev: 2477

Regards,
Ros.

Last edited 6 months ago by ros (previous) (diff)

comment:20 Changed 18 months ago by mvguarino

Hi Ros,

Thanks for this. I will try and let you know.
In the meantime, u-as245 ran fine for 3 cycles and then crashed with this error:

Traceback (most recent call last):
  File "./link_drivers", line 183, in <module>
    envinsts, launchcmds = _run_drivers(common_envars, mode)
  File "./link_drivers", line 66, in _run_drivers
    '(common_envars,\'%s\')' % (drivername, mode)
  File "<string>", line 1, in <module>
  File "/fs2/n02/n02/vittoria/cylc-run/u-as245/work/19200501T0000Z/coupled/nemo_driver.py", line 648, in run_driver
    exe_envar = _setup_executable(common_envar)
  File "/fs2/n02/n02/vittoria/cylc-run/u-as245/work/19200501T0000Z/coupled/nemo_driver.py", line 568, in _setup_executable
    controller_mode)
  File "/fs2/n02/n02/vittoria/cylc-run/u-as245/work/19200501T0000Z/coupled/top_controller.py", line 370, in run_controller
    nemo_dump_time)
  File "/fs2/n02/n02/vittoria/cylc-run/u-as245/work/19200501T0000Z/coupled/top_controller.py", line 248, in _setup_top_controller
    % top_dump_time % nemo_dump_time)
TypeError: not enough arguments for format string
[FAIL] run_model # return-code=1
Received signal ERR
cylc (scheduler - 2018-03-07T09:49:59Z): CRITICAL Task job script received signal ERR at 2018-03-07T09:49:59Z
cylc (scheduler - 2018-03-07T09:49:59Z): CRITICAL failed at 2018-03-07T09:49:59Z

It can’t find NEMO restart files, and indeed in /NEMOhist the last restart files written out are the 19200301 ones. However, looking at the job.out of coupled.19200401T0000Z the 19200501 restart files seem to have been produced (http://puma.nerc.ac.uk/rose-bush/view/mvguarino/u-as245?&no_fuzzy_time=0&path=log/job/19200401T0000Z/coupled/01/job.out). Could this be a memory problem?

Thanks,

Vittoria

comment:21 Changed 18 months ago by ros

Issue with u-as245 moved to new ticket #2427

comment:22 Changed 18 months ago by mvguarino

u-au022 is now running.

Thank you,

Vittoria

comment:23 Changed 18 months ago by ros

  • Resolution set to fixed
  • Status changed from new to closed
  • UM Version changed from <select version> to 10.7

That's great. Thanks for letting us know. I'll close this ticket now.

Cheers,
Ros.

Note: See TracTickets for help on using tickets.