Opened 3 months ago

Closed 4 weeks ago

#3009 closed help (completed)

setup of suite ensemble

Reported by: xd904476 Owned by: um_support
Component: UM Model Keywords:
Cc: Platform: ARCHER
UM Version: 10.7

Description (last modified by ros)

Hi, I am setting up a suite with 5 ensembles (u-bm680), starting from David Case suite u-bd149.
I have a problem in the suite.rc file that in the originally are (from u-bl536):

            graph = """
{{ 'validate_suite_info => install_ancil'  + ( ' & fcm_make2_pp' if POSTPROC else '') + (' & '  + FCMUM_LAST if BUILD_UM else '') + (' & fcm_make2_ocean' if BUILD_OCEAN else '') + (' & recon' if RECON else '') }}
{{ 'fcm_make_pp => fcm_make2_pp' + (' => POSTPROC' if RUN else '') if POSTPROC else '' }}
{{ 'fcm_make_pptransfer => fcm_make2_pptransfer' + (' => pptransfer' if RUN else '') if PPTRANSFER else '' }}  
{{ 'fcm_make_ocean => fcm_make2_ocean' + (' => recon' if RECON else ' => coupled' if RUN else '') if BUILD_OCEAN else '' }}
{{ FCMUM_GRAPH + (' => recon' if RECON else ' => coupled' if RUN else '') if BUILD_UM else '' }}
{{ 'fcm_make_drivers => fcm_make2_drivers' + (' => coupled' if RUN else '') if BUILD_DRIVERS else '' }}
{{ 'install_ancil => recon ' if RECON else ('install_ancil => coupled' if RUN else '')}}
{{ 'recon' + (' => coupled' if RUN else '') if RECON else '' }}
{{ 'clearout' + (' => coupled' if RUN else '') if CLEAROUT else '' }}
"""

and should include the extra ensemble and perturbation changes that are in suite.rc of u-bd149 which in these lines looks like this:

        [[[ R1 ]]]
            graph = """
{{ 'fcm_make_pp => fcm_make2_pp' + (' => POSTPROC_GROUP' if RUN else '') if POSTPROC else '' }}
{{ 'fcm_make_ocean => fcm_make2_ocean' + (' => recon' if RECON else ' => perturb<ensemble> => coupled<ensemble>' if RUN else '') if BUILD_OCEAN else '' }}
{{ FCMUM_GRAPH + (' => recon' if RECON else ' => perturb<ensemble> => coupled<ensemble>' if RUN else '') if BUILD_UM else '' }}
{{ 'fcm_make_drivers => fcm_make2_drivers' + (' => coupled<ensemble>' if RUN else '') if BUILD_DRIVERS else '' }}
{{ 'install_ancil => recon ' if RECON else ('install_ancil => perturb<ensemble> => coupled<ensemble>' if RUN else '')}}
{% if RUN %}
   {{ 'recon => perturb<ensemble> => coupled<ensemble>' if RECON else '' }}
   {{ 'clearout => coupled<ensemble>' if CLEAROUT else '' }}
   {{ 'coupled<ensemble> => plot_loadbalance' if PLOT_LOAD_BALANCE else '' }}
{% endif %}
"""

I have tried making the changes that I thought were required but there is clearly a bug that I can't see in these lines:

            graph = """
{{ 'validate_suite_info => install_ancil'  + ( ' & fcm_make2_pp' if POSTPROC else '') + (' & '  + FCMUM_LAST if BUILD_UM else '') + (' & fcm_make2_ocean' if BUILD_OCEAN else '') + (' & recon' if RECON else '') }}
{{ 'fcm_make_pp => fcm_make2_pp' + (' => POSTPROC_GROUP' if RUN else '') if POSTPROC else '' }}
{{ 'fcm_make_pptransfer => fcm_make2_pptransfer' + (' => pptransfer' if RUN else '') if PPTRANSFER else '' }}  
{{ 'fcm_make_ocean => fcm_make2_ocean' + (' => recon' if RECON else ' => coupled' if RUN else '') if BUILD_OCEAN else '' }}
{{ FCMUM_GRAPH + (' => recon' if RECON else ' => perturb<ensemble> => coupled<ensemble>' if RUN else '') if BUILD_UM else '' }}
{{ 'fcm_make_drivers => fcm_make2_drivers' + (' => coupled<ensemble>' if RUN else '') if BUILD_DRIVERS else '' }}
{{ 'install_ancil => recon ' if RECON else ('install_ancil => perturb<ensemble> => coupled<ensemble>' if RUN else '')}}
{% if RUN %}
   {{ 'recon => perturb<ensemble> => coupled<ensemble>' if RECON else '' }}
   {{ 'clearout => coupled<ensemble>' if CLEAROUT else '' }
{{ 'coupled<ensemble> => plot_loadbalance' if PLOT_LOAD_BALANCE else '' }}
{% endif %}
"""

the error I get is a syntax error and I tried a few things, but I can't get it to work.
Could you help pls?

thanks,
Dani

Attachments (1)

Screenshot 2019-09-24 at 16.43.45.png (51.7 KB) - added by xd904476 3 months ago.
Error_compiling

Download all attachments as: .zip

Change History (32)

comment:1 Changed 3 months ago by dcase

I've copied this suite, and made a number of changes which should get past the stage of processing the suite.rc . If you look at puma:/home/dcase/roses/u-bm680/suite.rc then you will have an idea, but to summarise:

  • There is a missing } in {{ 'clearout ⇒ coupled<ensemble>' if CLEAROUT else }}
  • [[POSTPROC_GROUP]] needs to be defined (it's just a dummy in mine for now)
  • [[ recon ]] is defined twice (I removed the one which inherits itself)
  • TEST_NRUN_CRUN is undefined, so I set {% set TEST_NRUN_CRUN = False %}
  • I changed the logic around in the graph. Basically I took some of the one liner if statements and unrolled them to make it easier to debug

I can't promise that this is doing the thing that you want, but compare changes and check the logic of the new suite carefully. It may be easier to get it working from this point.

Dave

comment:2 Changed 3 months ago by xd904476

Hi Dave,
thanks for your help, but I need some extra guidance. The model runs up to the point of getting to the 'perturb-couple' tasks.
I can see already in the rosie guy that the graph 'perturb' which should be created in suite.rc is not there, therefore I can't add anywhere the line that links to the .py file that performs the perturbations.
I have done a full diff between the suites and I believe I've done something wrong in the statements in lines 88-onwords. I think some of the "ifs" are not allowing the 'perturb' graph to be created.

Could you help pls?
Thanks,
Dani

comment:3 Changed 3 months ago by dcase

Excuse me if this appears twice:

  • you should have the perturb app referenced in the suite, but you need to copy app/perturb from u-bd149 so that it's available
  • If you can't see it in the GUI, then you can edit the command to run in app/perturb/rose-app.conf and the script in app/perturb/bin once you've copied them over

comment:4 Changed 3 months ago by xd904476

Hi Dave,
I had tried this already and again now with the same result. Attached you can find a screenshot of the available menus in the rose gui.
The suite on the left in the screenshot is a copy of u-bd149 and the one on the right is u-bm680.
You can see there that the menu "perturb" is not there together with the . "plot load balance", which I am not worrying about at the moment.
I thought that this kind of things were set either in the mai rose-app.conf or in suite.rc, but I can't get it to work.

Thanks,
dani

comment:5 Changed 3 months ago by dcase

I can't see your upload, but that may be because our website is being a bit annoying today.

More importantly, copy the app/perturb directory from u-bd149 into your new suite. Then you can see the app in the GUI, and also you can run the script.

comment:6 Changed 3 months ago by xd904476

Hi Dave,
I am still trying to get the ensemble suite running but I get often stuck.
I have change the 'stashs' variable in perturb_ini.py to the air temperature only rather than 4 parameters, but I can't get the model to call the routine because I don't know where to set the parameter 'CYLC_TASK_PARAM_ensemble' which appears in various calls. I can't find it set in u-bd149 either. How can I solve this?
thanks
dani

comment:7 Changed 3 months ago by dcase

The CYLC_TASK_PARAM_ensemble variable is under the [[[environment]]] section in the suite.rc. You do set this (look for it in the job file to be sure).

In your job.out for ensemble 0, there is the line:

[FATAL ERROR] Output dump already exists, will not overwrite: /work/n02/n02/dflocco/cylc-run/u-bm680/share/data/bm680.astart_0 so it looks as though you are running this and have previously made a file.

comment:8 Changed 3 months ago by xd904476

Hi Dave,
I have managed to pass the "perturb" stage and the initial conditions files are created, but I keep getting the same error as before:
CYLC_TASK_PARAM_ensemble: unbound variable which I find in

/work/n02/n02/dflocco/cylc-run/u-bm680/log/job/20150101T0000Z/coupled_ensemble0/job.err
this is also the task where the model fails. Am I looking into the wrong error log again?

thanks,
dani

comment:9 Changed 3 months ago by dcase

I've looked in your job, and you'll see that the variable is being set, but is actually being used before this point (look in the job file, and you'll see this). I expect that this wasn't a problem for me because I was using a more recent version of Cylc (you have 6.11.4). If I were you, I would either try to run with a newer version of cylc (I think that versions up to 7.6.1 are available on Puma), or if you don't want to change Cylc you could hack away at the suite.rc

To try the hacking approach, I would first look at the variable DATAM: this is set in runtime-root-environment and is causing the crash, but also is set in coupled<ensemble>-environment. In the case of this app you may be better setting it at the coupled<ensemble> stage, as this should work. You'll have the think about what this variable is doing, though, as taking it out of the runtime-root-environment will probably cause issues for other apps.

comment:10 Changed 3 months ago by xd904476

Hi Dave,
I was looking into upgrading the version of CYLC, but if I do this, there is a chance of other suites not working?

I understand the double setup of the variable. The variable DATAM is set as 'History_data'
./rose-suite.conf:DATAM='History_Data'

perhaps the issue is depending on where to store all the ensemble members under history_data? Should I set two different variables? in this case where?
thanks,
dani

comment:11 Changed 3 months ago by dcase

Dani,

I see that you've restarted the suite with a comment. I think that doing this kind of thing is a better idea than changing the Cylc version, as apparently some of the versions behave strangely on Puma. I was suggesting this as it's possible that the new ones will export the CYLC_TASK_* variables before using them, but hopefully you can find a workaround by changing the environment, as you're trying.

I think that at the moment you have commented the environment variable out in recon as well as in the runtime root, hence the variable is unbound at the recon stage. If you set the environment variables in the recon environment, hopefully you can get back to the coupled stage.

comment:12 Changed 3 months ago by xd904476

Hi Dave, thanks.
The DATAM bit in recon was already commented out.
Should I uncomment all the lines in that bit? as in:

recon?

inherit = None, RECON
[environment?]

DATAM = $ROSE_DATA/{{DATAM}}
ASTART = $ROSE_DATA/$RUNID.astart
ENS_MEMBER = 0

Or shall I only uncomment the DATAM declaration and leave the ASTART and ENS_MEMBER commented out?

thanks,
dani

comment:13 Changed 3 months ago by dcase

I think that debugging suites involves a lot of trial and improvement. Luckily it'll crash quickly and not waste resources.

I'd uncomment the section, as it won't corrupt other apps. Presumably you're intending to only run recon once, and then perturb this file many times?

I'll try to run the suite myself too as it's impossible to work everything out a priori.

comment:14 Changed 3 months ago by dcase

Also, if you just uncomment the recon section, there will now be 2. I'd move the variables from the commented section, such as DATAM, into the first one.

Changed 3 months ago by xd904476

Error_compiling

comment:15 Changed 3 months ago by xd904476

I tried uncommenting it all: it does not compile with the attached error.
sorry, I'm not experienced with this level of suite handling

comment:16 Changed 3 months ago by xd904476

trying with moving just DATAM in the previous recon… let's see how it goes

comment:17 Changed 3 months ago by dcase

I ran the suite myself. The variable is set in two places, and it appears that it takes the value from the second instance, but declares it in the position of the first instance. This is annoying, and I don't know why it happens: possibly a new cylc would not have it, but possibly there is something arising from the nature of your suite being two different suites put together. Anyway - with your comment the coupled run is working (for me), but to run everything from the start you will need to put the variable in to the [[[environment]]] sections of the apps that need it (as its not in the runtime-root section anymore).

comment:18 Changed 3 months ago by xd904476

Hi Dave,
I have added the datam declaration in a few places but I am stuck in the 'clearout' stage: I get various errors if I declare it or not. I now tried moving it down under recon, but I can't get to the 'recon' stage anymore.
the suite fails before creating the perturbed initial conditions.
is your suite running further than this? can I look at your suite.rc?
dani

comment:19 Changed 3 months ago by dcase

I reran the suite this afternoon, until the main jobs started running. You can copy my version from here:

/home/dcase/roses/u-bm680 on puma.

comment:20 Changed 3 months ago by xd904476

I had an extra DATAM declared somewhere.
It is running the coupled task now, thanks!!

dani

comment:21 Changed 2 months ago by xd904476

Hi, another hiccup…
the first coupled task runs successfully, while all the others fail because they don't find the partial sum files:
/work/n02/n02/dflocco/cylc-run/u-bm680/share/data/History_Data/ensemble_NN/bm680a_s1b, or s1a etc.
Do you think this may be because the suite stopped running for lack of space yesterday and I restarted it, or is there something missing in the indication to the suite on the location where to write the partial sum files?
Do you think it's worth trying to rerun it perhaps with 2 or 3 ensembles?

thanks,
dani

comment:22 Changed 2 months ago by dcase

If you can't see files, and you've hit file space issues and had to restart, then this is the first place to look for errors. Hopefully if you clean up and restart it'll work.

Are you suggesting that 5 ensembles is creating too much data? If so you could go down to 2 (which I think is what I did), but you'll have to scale this up eventually so be as clean as possible, or perhaps you'll have to consider running shorter jobs?? I killed my jobs after a few minutes to save resources, so won't have made the same files as you.

comment:23 Changed 2 months ago by xd904476

Thanks Dave. I now have an increased quota. Trying with 2 ensemble members.

comment:24 Changed 2 months ago by xd904476

Hi Dave, me again… the coupled task finally runs, but I am now getting issues in the postproc and pptransfer. I think I'm not getting quite right the output paths for the postproc that now needs to go into the coupled ensembles rather than just getting everything together.
I have changed the pptransfer and postproc datam paths in suite.rc. I have then reloaded the suite but I keep getting errors of unbound variables.
Could you guide me some more pls?
dani

comment:25 Changed 2 months ago by dcase

At the moment it looks as though you are splitting postproc into components (atmos, cice, nemo) but you are not running a separate postproc per ensemble. Then when you look for a parameter to denote the ensemble, for example, one may not be defined (as it doesn't make sense.. if you want to run postproc once for all the ensembles you'll have to set it up to do this).

One thing that you might do is run postproc for each ensemble separately, so in your log you would see postproc_atmos0, postproc_nemo1 etc . This may be easier.

I'm sorry if I've misunderstood your suite (this is quite possible), but it looks as though getting the logic of the original suite.rc file so that it runs postproc for each ensemble would be the thing to try.

comment:26 Changed 2 months ago by xd904476

Hi Dave,
I have tried to merge the ensemble changes from suite u-bi318 into mine, but I am missing some declaration somewhere and I can't find it.
The suite compiles but it stops at the fcm)_make_pp and fcm_make_um task with the usual declaration error. Could you shed some light pls?
dani

comment:27 Changed 2 months ago by dcase

I see that you're trying to combine this suite with yours, but the first thing to note is that you don't need to do things like define ensemble variables for a second time. You already have <ensemble> defined in a [[parameters]] section, so making another variable called <ens> will not help/cause trouble.

As for your make tasks failing- one of these seems to have succeeded, and the other has an unbound variable, but you don't need ensemble variables for the make: you only need to build the code once, no matter how many ensembles you have.

I would go back a step and only implement the changes to the graph and the environment. What I mean by this is to not include the stuff at the top of the suite rc with these extra ENS variables, but do include pptransfer<ensemble> in your graph, and also the [[pptransfer<ensemble> directive section (and same for postproc). Your version of this suite previously ran the make stuff correctly, so rolling back will get it to work again and these changes will not break the things above them.

comment:28 Changed 2 months ago by xd904476

Hi Dave,
I have the suite only working up to fcm_make pptransfer at the moment. I am stuck when merging the lines from 110 onwards in u-bm494 with the pptransfer and postproc. The two suites are very different at that point and I cna't get the fcm_make_(pp/transfer) to work.
Could you help pls?

thanks,
dani

comment:29 Changed 2 months ago by dcase

For your immediate point, you have the same "unbound variable" error in your fcm_make_pptransfer job.err file. This points you to line 121 in the job file, which is a variable which you have added at line 423 of the suite.rc, in the [[PPTRANSFER_BUILD]] section. If you comment this line, you should get past this error.

In more general terms, I understand the difficulties in combining two suites together to get more functionality, but as was discussed above, it's much easier to make edits to a suite than to copy and paste new things in. I haven't run this suite in the current state, but it does look as though a lot has been added which may not be needed.

comment:30 Changed 4 weeks ago by ros

  • Description modified (diff)
  • Platform set to ARCHER
  • UM Version set to 10.7

comment:31 Changed 4 weeks ago by ros

  • Resolution set to completed
  • Status changed from new to closed

Continued in #3063

Note: See TracTickets for help on using tickets.