Opened 9 days ago

Closed 3 days ago

#2759 closed help (fixed)

CMIP6 Eocene coupled suite working (yay!) but postprocessing/archiving not

Reported by: charlie Owned by: ros
Priority: normal Component: UM Model
Keywords: Cc:
Platform: NEXCS UM Version: 10.7

Description

Hi,

You'll be pleased to hear that my fully-coupled Eocene simulation (u-be195) is now working, and has run for a year. I had to slow down the ocean timestep because of a blowup around month 3, so our thinking was to run it for a year and then restart (in a new suite) from the resulting start dump, if the hope that this might stabilise the ocean and allow the faster timestep.

Before I do that although, I have 2 questions to do with the postprocessing and archiving:

Firstly, my current postprocessing. I have run this suite for a year, with 3 month cycling. However, the last cycle hasn’t been sent to NEXCS (at /projects/nexcs-n02/cwilliams/sweet/gc31/u-be195) - only the first 3 cycles are here (18500101T0000Z, 18500401T0000Z and 18500701T0000Z) but the last (18501001T0000Z) isn’t. Instead, the remaining 3 months (October, November and December) still appear to be in: /home/d05/cwilliams/cylc-run/u-be195/share/data/History_Data. A Met Office colleague told me that in older versions of postproc, it was necessary to run one more cycle that you wanted in order to trigger the postprocessing e.g. if I wanted it to post process for one year, I would have to run 1 year and 3 months. Is this what's going on here? I thought we had already upgraded the suite to version 2.2 (I thought we did that a while ago get it working), but perhaps not? Is there anything we can do about this?

Secondly, my current archiving. My output appears to be going to /projects/nexcs-n02/cwilliams/sweet/gc31/u-be195/ on NEXCS, as it should (because I set this in the suite at postproc > Archer archiving), but then doesn't appear to be automatically transferring to JASMIN. It should be going automatically from NEXCS to JASMIN (at /group_workspaces/jasmin2/nexcs/cwilliams2011/sweet/gc31), which is set in postproc > JASMIN transfer, but I have just checked this directory, and it's empty. I thought I had followed all of Ros' instructions correctly, but perhaps I missed something?

Please can you help?

Charlie

Change History (28)

comment:1 Changed 8 days ago by charlie

Hi again,

Further to this, I have now copied u-be195 to another suite (after committing), u-bf814. I have started u-be195 (which is slow due to a slowed down ocean, but works) going again, and I'm about to start u-bf814 (which restarts from u-be195 and is using the usual ocean timesteps, but might become unstable again). Both suites, therefore, have the above problem when postprocessing/archiving, so please can we fix both, even if it means restarting?

Charlie

comment:2 Changed 6 days ago by ros

Hi Charlie,

How did you run the suite for 1 year? Did you have the run target length set to 1 year and thus the suite stopped itself or did you have the run length set for say 20years and then you stopped it manually after 1 year? The final cycle files will only be archived when the suite reaches its target run length and you have ARCHIVE_FINAL=true.

I've looked at your suite regarding pptransfer and it doesn't contain any of the pptransfer related code in the suite.rc or archer.rc files. I suspect you haven't followed the ARCHER specific instructions that were linked to at the bottom of the page I gave you. Namely: http://cms.ncas.ac.uk/wiki/Docs/PostProcessingAppArcherSetup - you are in Scenario 1, having had to add the transfer app, so you need follow Scenario 1 instructions.

Once you have done this setup correctly you will see a pptransfer task when you start your suite running.

Regards,
Ros.

comment:3 Changed 6 days ago by ros

Of course I meant NEXCS instructions not ARCHER!!! Please follow the instructions on the NEXCS page: http://cms.ncas.ac.uk/wiki/Docs/PostProcessingAppNexcsSetup

comment:4 Changed 5 days ago by charlie

Thanks so much Ros, I'm just about to do this to both my suites. Does it matter that one of my suites (u-be195) is currently running, or rather queueing, when I do this? Will it pick up my changes at the next cycle, or do I need to shutdown, clean and restart from the beginning?

Regarding your first question: yes, I ran the suite for 1 year only, which it successfully completed and so it stopped itself. Nevertheless, it only transferred the first 3 cycles, not the last. The last was only transferred once I restarted it (it is now set to go for 10 years).

Charlie

comment:5 Changed 5 days ago by ros

  • Owner changed from um_support to ros
  • Status changed from new to accepted

Hi Charlie,

You don't need to stop the suite, however you will need to do a couple of things to force it to pick and insert the new task.

1) Reload the suite: rose suite-run --reload
2) Hold the whole suite, or just the next postproc task
3) In the Cylc GUI: Control —> Insert Task(s)…
4) Set TASK-NAME.CYCLE-POINT=pptransfer.<YYYYMMDDT0000Z>, where <YYYYMMDDT0000Z> is an active cycle point
5) Leave stop-point=POINT blank
6) Check the "Do not check if a cycle point is valid or not" box
7) Insert, and the pptransfer task should appear in the GUI
8) If nothing happens: You probably typed something incorrectly! Try again.
9) Release the held suite/postproc task

If the pptransfer task still doesn't appear, you may have made a mistake editing the suite.rc/MONSooN.rc file.

Not sure what happened re the archiving. Now the suite is running again I am unable to track that problem down. Hopefully it will work as expected at the end of the 10years.

Cheers,
Ros.

comment:6 Changed 5 days ago by charlie

Okay, many thanks. I will do all that now. Do I need to reload the suite, even though it is currently queueing the coupled stage?

comment:7 Changed 5 days ago by charlie

Okay Ros, I have now followed those instructions and got to the part where it says I need to contact you to help setup the ssh-key to connect to JASMIN.

comment:8 Changed 5 days ago by ros

Hi Charlie,

I have sent you instructions for ssh-key setup by email.

Cheers,
Ros.

comment:9 Changed 5 days ago by charlie

Thanks Ros.

NOTE (on behalf of Ros): these instructions for ssh-key setup are not suite dependent, so if they have been done once (e.g. for other suites), don't need to be done again.

comment:10 Changed 5 days ago by charlie

Hi Ros,

Right, I'm clearly not doing the reloading properly. My suite is currently queueing at the coupled stage, so I didn't reload it first. Or at least, I tried to, but got:

cwilliams@xcslc0:~/roses/u-be195> rose suite-run --reload
[INFO] export CYLC_VERSION=7.8.1
[INFO] export ROSE_ORIG_HOST=xcslc0
[INFO] export ROSE_SITE=
[INFO] export ROSE_VERSION=2019.01.0
[INFO] delete: log/rose-suite-run.conf
[INFO] symlink: rose-conf/20190211T161521-reload.conf <= log/rose-suite-run.conf
[INFO] delete: log/rose-suite-run.version
[INFO] symlink: rose-conf/20190211T161521-reload.version <= log/rose-suite-run.version
[INFO] delete: suite.rc
[INFO] install: suite.rc
[FAIL] cylc validate -o /working/d05/cwilliams/jtmp/tmp.PZRjOIQVmi/tmpo8vEla --strict u-be195 # return-code=1, stderr=
[FAIL] WARNING - naked dummy tasks detected (no entry under [runtime]):
[FAIL] 	+	fcm_make2_pptransfer
[FAIL] 	+	pptransfer
[FAIL] 	+	fcm_make_pptransfer
[FAIL] 'ERROR: strict validation fails naked dummy tasks'
cwilliams@xcslc0:~/roses/u-be195>

Assuming that doesn't matter, I went ahead with the other steps, but nothing happens at step 7. I have doublechecked typos and tried again, but still nothing appearing.

I don't think I made a mistake editing my suite.rc, at least I followed the instructions precisely, but perhaps I did?

Charlie

comment:11 Changed 5 days ago by ros

Hi Charlie,

Ok. You suite is slightly different as it is set to split up postproc into multiple tasks. So the instructions are slightly different. :-( It's very difficult to cater for all suites, as you know they can vary, but I will modify the instructions to try and cater for the splitpp setup.

Around about line 282 please replace the POSTPROC if block with the following:

{% if POSTPROC %}
  {% if SPLIT_PP %}
    [[postproc_atmos]]
        inherit = None, POSTPROC

    [[postproc_nemo]]
        inherit = None, POSTPROC

    [[postproc_cice]]
        inherit = None, POSTPROC
        
  {% else %}
    [[postproc]]
        inherit = None, POSTPROC
  {% endif %}

  {% if PPTRANSFER %}
    [[pptransfer]]
        inherit = None, POSTPROC
        [[[environment]]]
            ROSE_TASK_APP = postproc
  {% endif %}    
{% endif %}   

Cheers,
Ros.

comment:12 Changed 5 days ago by ros

P.S. You must do the reload before trying to insert the pptransfer task. The reloading reloads the suite modifications into the running suite. Without doing this the suite will not know anything about your changes.

comment:13 Changed 5 days ago by charlie

Okay, I have now done that and reloaded, but different error this time:

cwilliams@xcslc0:~/roses/u-be195> rose suite-run —reload
[INFO] export CYLC_VERSION=7.8.1
[INFO] export ROSE_ORIG_HOST=xcslc0
[INFO] export ROSE_SITE=
[INFO] export ROSE_VERSION=2019.01.0
[INFO] delete: log/rose-suite-run.conf
[INFO] symlink: rose-conf/20190211T165427-reload.conf ⇐ log/rose-suite-run.conf
[INFO] delete: log/rose-suite-run.version
[INFO] symlink: rose-conf/20190211T165427-reload.version ⇐ log/rose-suite-run.version
[INFO] delete: suite.rc
[INFO] install: suite.rc
[FAIL] cylc validate -o /working/d05/cwilliams/jtmp/tmp.PZRjOIQVmi/tmpIVQ2U7 —strict u-be195 # return-code=1, stderr=
[FAIL] ERROR - POSTPROC:succeed-all ⇒ pptransfer
[FAIL] 'ERROR, self-edge detected: pptransfer:succeed ⇒ pptransfer'
cwilliams@xcslc0:~/roses/u-be195>

comment:14 Changed 5 days ago by ros

Hi Charlie,

Further complication of the splitpp. Rather than try and tell you which bits need changing I've taken your suite.rc file and modified it. Please copy my ~rhatcher/roses/u-be195/suite.rc. Do the rose suite-run --reload and then follow the instructions to insert the pptransfer task.

Cheers,
Ros.

CMS Note: Details on the changes required are documented in ticket #2690

comment:15 Changed 5 days ago by charlie

Thanks Ros.

The first problem, however, is that I can't get my suite to run, after whatever happened to the machine yesterday afternoon (despite an email this morning saying it was all back to normal):

cwilliams@xcslc0:~/roses/u-be195> rose suite-run —reload
[FAIL] u-be195: does not appear to be running
cwilliams@xcslc0:~/roses/u-be195> rose suite-run —restart
[INFO] export CYLC_VERSION=7.8.1
[INFO] export ROSE_ORIG_HOST=xcslc0
[INFO] export ROSE_SITE=
[INFO] export ROSE_VERSION=2019.01.0
[INFO] delete: log/rose-suite-run.conf
[INFO] symlink: rose-conf/20190212T114813-restart.conf ⇐ log/rose-suite-run.conf
[INFO] delete: log/rose-suite-run.version
[INFO] symlink: rose-conf/20190212T114813-restart.version ⇐ log/rose-suite-run.version
[FAIL] bash -ec H=$(rose\ host-select\ xcs-c);\ echo\ $H # return-code=1, stderr=
[FAIL] [WARN] xcs-c: (ssh failed)
[FAIL] [FAIL] No hosts selected.

I haven't changed anything since the reboot.

Charlie

comment:16 Changed 5 days ago by ros

Hi Charlie,

I'm not entirely convinced everything is hunkey dorey with Monsoon still….

Can you try logging out of xcs and back in again? If that doesn't clear it can you try running rose host-select xcs-c on the command line. You should get back the following:

rhatcher@xcs-c$ rose host-select xcs-c
xcs-c

Also try running ssh xcs-c - you may get a couple of keysign messages but you should be logged in ok.

I can submit your suite fine as well as stop & restart it.

Give those few things a go and let me know how you get on.

Cheers,
Ros.

comment:17 Changed 5 days ago by charlie

I have just tried all of those, including logging out and back in again, and whatever they did yesterday has properly messed up my account, at least!

cwilliams@xcslc0:~> rose host-select xcs-c
[WARN] xcs-c: (ssh failed)
[FAIL] No hosts selected.
cwilliams@xcslc0:~> ssh xcs-c
could not open any host key
ssh_keysign: no reply
key_sign failed

Charlie

comment:18 Changed 5 days ago by ros

Hi Charlie,

With the keysign failure does it actually go on to log you into xcs-c or does it just get stuck - for comparison I get:

rhatcher@xcs-c$ ssh xcs-c
could not open any host key
ssh_keysign: no reply
key_sign failed
Last login: Tue Feb 12 12:22:15 2019 from 10.168.5.6

    This computer is provided for the processing of Official Information.
    Unauthorized access may constitute a criminal offence. All activity
    on the system is liable to monitoring.


rhatcher@xcs-c$ 

Please confirm and then I'll get in contact with the Met Office.

Cheers,
Ros.

comment:19 Changed 5 days ago by charlie

It just asks for a password:

cwilliams@xcslc0:~> ssh xcs-c
could not open any host key
ssh_keysign: no reply
key_sign failed
Password: 

Which password is this? I didn't think I had any passwords here, other than my MOSRS password and of course the key fob?

comment:20 Changed 4 days ago by ros

Hi Charlie,

I think something may have changed on Monsoon, Willie also sees the same symptoms as you. I've contacted the Met Office. It works ok for me because I happen to have an extra entry in my authorized_keys file from when I was helping set up XCS.

I'll get back to you when I have any news.

Cheers,
Ros.

comment:21 Changed 4 days ago by ros

Hi Charlie,

The password issue should be fixed now. Can you try submitting again?

Cheers,
Ros.

comment:22 Changed 4 days ago by charlie

Great, I have now restarted, reloaded and have successfully inserted pptransfer (which has now appeared in my current cycle). Will this appear in the next cycle, in due course?

Also, the modification you made to my suite.rc - is that suite specific? I ask because I need to repeat the entire process in my 2nd suite - can I just repeat the above, obviously following the above process and pasting your new lines into the new suite.rc?

Charlie

comment:23 Changed 4 days ago by ros

Hi Charlie,

That's great. Yes the pptransfer task should appear in automatically in subsequent cycles.

If your 2nd suite is a copy of, or is similar set up to the current suite you should be able to follow the process above and then mirror the changes I made to the suite.rc file in your new suite's suite.rc file.

Cheers,
Ros.

comment:24 Changed 3 days ago by charlie

Thanks Ros, I will make the same changes to my other suite (which is indeed a direct copy).

However, I'm not sure it has properly worked - looking at my sgc, 185110 has completed but the pptransfer and housekeeping apps are just "waiting". Likewise, 185201 has completed, but there is no pptransfer at all. Lastly, 185204 is now running the coupled stage, but again there is no pptransfer at all. And nothing has appeared in the appropriate directory on JASMIN.

Charlie

comment:25 Changed 3 days ago by ros

Hi Charlie,

Have you tried triggering the pptransfer task?

Regarding the subsequent cycles; tasks only appear in the next cycle when the previous cycle task has been submitted. I.e when the pptransfer task in 185110 is submitted the next pptransfer task should appear in the 185201 cycle.

Cheers,
Ros.

comment:26 Changed 3 days ago by charlie

Are yes, that seems to be working now, and yes the next one has appeared in the next cycle.

I'll just try making the changes to my other suite and, assuming this also works, will close this ticket.

Many thanks

Charlie

comment:27 Changed 3 days ago by charlie

Yep, the above process appears to have worked in my other suite as well. Many thanks again.

Charlie

comment:28 Changed 3 days ago by charlie

  • Resolution set to fixed
  • Status changed from accepted to closed
Note: See TracTickets for help on using tickets.