Opened 3 years ago

Closed 3 years ago

#1896 closed help (fixed)

RosePopenError / Jinja2Error on job submission

Reported by: marcus Owned by: ros
Component: UM Model Keywords:
Cc: Platform: ARCHER
UM Version: 10.4

Description

Hi, this is my first attempt to submit a 10.4 suite on ARCHER using Rose. When submitting suite u-ae213 (which is a copy of u-ad575) I am getting the following error message:

RosePopeError?: cylc validate -v —strict u-ae213 # return-code=1, stderr=Jinja2Error:
File "/usr/local/python/lib/python2.6/site-packages/Jinja2-2.7.3-py2.6.egg/jinja2/loaders.py", line 178, in get_source
raise TemplateNotFound?(template)
TemplateNotFound?: site/archer.rc

What do I need to do to fix this please?

CYLC_VERSION=6.91
ROSE_VERSION=2016.03.0

Change History (24)

comment:1 Changed 3 years ago by marcus

  • priority changed from normal to high

comment:2 Changed 3 years ago by ros

  • Owner changed from um_support to ros
  • Status changed from new to accepted

Hi Marcus,

This suite has not been set up to run on Archer and thus doesn't have any of the Archer configuration files. Is this the exact UM configuration that you wish to run or were you just taking this suite as one to try out? We do have a few other standard GA7.0 (UM10.4) suites available on Archer.

If this is the configuration you require, I can port this suite over for you. We haven't quite yet got a set of definitive instructions on how to do this.

Regards,
Ros.

comment:3 Changed 3 years ago by ros

Hi Marcus,

Apologies, this suite has been ported to ARCHER but the main archer.rc config file hasn't been committed to the repository. You can either grab it from ~grenville/roses/u-ad575/site/archer.rc on PUMA and copy it into the site directory of your suite working copy or wait until Grenville can check it in next week and then update your suite.

Regards,
Ros.

comment:4 Changed 3 years ago by marcus

Hi Ros,

Thanks for your help on this. I have copied the file over and tried to commit the change but that failed (error message below). Should I not be doing this?

marcus@puma:/home/marcus/roses/u-ae213> fcm commit
[info] xterm -e vi: starting commit message editor...
Change summary:
--------------------------------------------------------------------------------
[Root   : https://code.metoffice.gov.uk/svn/roses-u]
[Project: a/e/2/1/3]
[Branch : trunk]
[Sub-dir: ]

M       app/fcm_make_um/rose-app.conf
M       rose-suite.conf
A       site/archer.rc
--------------------------------------------------------------------------------
Commit message is as follows:
--------------------------------------------------------------------------------
Added config file archer.rc from ~grenville/roses/u-ad575/site/archer.rc
(ticket #1896)
--------------------------------------------------------------------------------

*** WARNING: YOU ARE COMMITTING TO THE TRUNK.
*** Please ensure that your change conforms to your project's working practices.

Would you like to commit this change?
Enter "y" or "n" (or just press <return> for "n"): y
Sending        app/fcm_make_um/rose-app.conf
Sending        rose-suite.conf
Adding         site/archer.rc
Transmitting file data ...svn: E165001: Commit failed (details follow):
svn: E165001: Commit blocked by pre-commit hook (exit code 1) with output:
2016-06-20T10:38:11Z+ 14322-bae by MarcusKoehler
U   a/e/2/1/3/trunk/app/fcm_make_um/rose-app.conf
U   a/e/2/1/3/trunk/rose-suite.conf
A   a/e/2/1/3/trunk/site/archer.rc
[FAIL] PERMISSION DENIED: A   a/e/2/1/3/trunk/site/archer.rc
[FAIL] PERMISSION DENIED: U   a/e/2/1/3/trunk/app/fcm_make_um/rose-app.conf
[FAIL] PERMISSION DENIED: U   a/e/2/1/3/trunk/rose-suite.conf
[FAIL] svn commit -F /tmp/oTKWteqDPk # rc=1

I then tried to submit without prior committing and I got the following error:

RosePopenError: bash -ec H=$(rose\ host-select\ archer);\ echo\$H # retunr-code=1, stderr=
[WARN] login5.archer.ac.uk: (ssh failed)
[WARN] login8.archer.ac.uk: (ssh failed)
[WARN] login3.archer.ac.uk: (ssh failed)
[WARN] login2.archer.ac.uk: (ssh failed)
[WARN] login.archer.ac.uk: (ssh failed)
[WARN] login6.archer.ac.uk: (ssh failed)
[WARN] login1.archer.ac.uk: (ssh failed)
[WARN] login7.archer.ac.uk: (ssh failed)
[WARN] login4.archer.ac.uk: (ssh failed)
[FAIL] No hosts selected

Many thanks,
Marcus

comment:5 Changed 3 years ago by ros

Hi Marcus,

It's not allowing you to commit your changes because somehow you've managed to create the suite with the author "MarcusKoehler"…….. The pre-commit hook checks to ensure that it is the owner that is committing to the suite trunk and your MOSRS username is "marcuskoehler" so it won't allow it. I don't think it is possible to fix this - so if you want to commit any changes to this suite you will need to start again and re-run rosie copy. You will still need to copy in the site/archer.rc file from Grenville's directory as he currently can't check it in due to a problem with his MOSRS setup. :-(

You can submit the suite without checking in changes. The problem I believe you have is that your ssh-agent isn't setup correctly to allow you to login to ARCHER without a prompt for password or passphrase. You may just need to run ssh-add.

Cheers,
Ros.

comment:6 Changed 3 years ago by marcus

Hi Ros,

I have copied the suite again from u-ad575 and my copy is now u-ae313. I changed the pre-completed name owner name from MarcusKoehler? to marcuskoehler but I can still not commit the changes: It again assumes my username is MarcusKoehler?. Where do I need to change this?

Best,
Marcus

comment:7 Changed 3 years ago by marcus

Hi Ros — update: I found the faulty username entry in a file in ~/.subversion/ and corrected it. The commit works now. Also, after setting up the ssh-agent once more I managed to submit the job now. All fine as far as I can see. Let's hope it compiles and runs.
Many thanks,
Marcus

comment:8 Changed 3 years ago by marcus

Hi Ros,

The suite compiled successfully but crashed at run time with the following error:

[WARN] file:IOSCNTL: skip missing optional source: namelist:lustre_control
[WARN] file:IOSCNTL: skip missing optional source: namelist:lustre_control_custom_files
[WARN] file:IDEALISE: skip missing optional source: namelist:idealise
[WARN] file:RECONA: skip missing optional source: namelist:trans(:)
/bin/sh: um-atmos: command not found
[FAIL] um-atmos # return-code=127
Received signal ERR
cylc (scheduler - 2016-06-21T09:06:45Z): CRITICAL Task job script received signal ERR at 2016-06-21T09:06:45Z
cylc (scheduler - 2016-06-21T09:06:45Z): CRITICAL atmos_main.19810901T0000Z failed at 2016-06-21T09:06:45Z

I'm not sure what to do about this.

Many thanks,
Marcus

comment:9 Changed 3 years ago by ros

Hi Marcus,

It doesn't look like you have compiled the model executable yet. There is no build directory or executable under /work/n02/n02/marcus/cylc-run/u-ae313/share.

The suite is currently set up to "Run the Model" only. (See Suite conf → Build and run switches)

Cheers,
Ros.

comment:10 Changed 3 years ago by marcus

Hi Ros,

Thank you I have now enabled all build and run switches.

When editing the suite I get the following error message:

rose.metadata_check.MetadataChecker: issues: 1
    jinja2:suite.rc=HPC_QUEUE=None=None
        No metadata entry found

I've checked the file ~/roses/u-ae313/meta/rose-meta.conf but I cannot find this line or anything in there that would be obvious for me to correct.

Many thanks,
Marcus

Last edited 3 years ago by marcus (previous) (diff)

comment:11 Changed 3 years ago by ros

Hi Marcus,

You can ignore this. I think something changed in the metadata somewhere and it's caused a warning - I get it in a lot of my suites. I will investigate, however, HPC_QUEUE relates to the Met Office Cray and so it won't affect running on ARCHER.

Cheers,
Ros.

Last edited 3 years ago by ros (previous) (diff)

comment:12 Changed 3 years ago by marcus

Hi Ros,

Unfortunately the job did not compile. For some reason the netcdf module was not found as it seems.

Log output job.err is here:

[FAIL] ftn -oo/emiss_io_mod.o -c -I./include -s default64 -e m -J ./include -I/work/n02/n02/hum/gcom/cce8.4.1/gcom5.4/archer_xc30_cce_mpp/build/include -O2 -Ovector1 -hfp0 -hflex_mp=strict -h omp /work/n02/n02/marcus/cylc-run/u-ae313/share/fcm_make_um/preprocess-atmos/src/um/src/atmosphere/UKCA/emiss_io_mod.F90 # rc=1
[FAIL] 
[FAIL] 
[FAIL] ftn-855 crayftn: ERROR EMISS_IO_MOD, File = ../../../../cylc-run/u-ae313/share/fcm_make_um/preprocess-atmos/src/um/src/atmosphere/UKCA/emiss_io_mod.F90, Line = 35, Column = 8 
[FAIL]   The compiler has detected errors in module "EMISS_IO_MOD".  No module information file will be created for this module.
[FAIL] 
[FAIL] 
[FAIL] ftn-292 crayftn: ERROR EMISS_IO_MOD, File = ../../../../cylc-run/u-ae313/share/fcm_make_um/preprocess-atmos/src/um/src/atmosphere/UKCA/emiss_io_mod.F90, Line = 38, Column = 5 
[FAIL]   "NETCDF" is specified as the module name on a USE statement, but the compiler cannot find it.
[FAIL] 
[FAIL] 
[FAIL] ftn-232 crayftn: ERROR EM_FOPEN, File = ../../../../cylc-run/u-ae313/share/fcm_make_um/preprocess-atmos/src/um/src/atmosphere/UKCA/emiss_io_mod.F90, Line = 179, Column = 13 
[FAIL]   IMPLICIT NONE is specified in the local scope, therefore an explicit type must be specified for function "NF90_OPEN".

This strikes me as odd as the job obviously must have compiled for Grenville?

Regards,
Marcus

comment:13 Changed 3 years ago by ros

Hi Marcus,

Is this the exact configuration of the UM that you require or are you using this purely as a test to run on ARCHER? If the latter then we have other suites that have been fully ported. The reason I ask is that having spoken to Grenville, this is a suite he was testing out and so it isn't guaranteed to just run for everyone. I can take a look at it, but wanted to make sure this is the exact configuration you require first.

Regards,
Ros.

Last edited 3 years ago by ros (previous) (diff)

comment:14 Changed 3 years ago by marcus

Hi Ros,

I picked up this suite on advice from Luke Abraham. I was told it is the exact copy of the suite I am running on Monsoon.

On Monsoon I use a copy of u-ad048 which I am testing there and setting up for a longer production run. The run will later have to be carried out on Archer. So I need equivalent suites with the same set-up on Monsoon and Archer.

If there is another 10.4 GA7.0 AMIP UKCA StratTrop? suite that is more suitable and which has the same settings as my Monsoon suite then I'd be happy to switch of course. Can I use u-ad048 also on Archer?

Many thanks,
Marcus

comment:15 Changed 3 years ago by ros

Hi Marcus,

Ok. Leave it with me and I'll get it into a state where you can take it and run.

Cheers,
Ros.

comment:16 Changed 3 years ago by marcus

Thank you very much!
Marcus

comment:17 Changed 3 years ago by ros

Hi Marcus,

Please take a copy of my suite u-ae374, this is a copy of u-ad048 with the ARCHER options/settings added.

To get it compiling/running you should just need to change the following in:

suite conf → Machine Options - account code and queue

suite conf → Run initialisation & cycling - run length, cycling frequency & wallclock

Cheers,
Ros.

comment:18 Changed 3 years ago by marcus

Hi Ros,

Thank you for setting up the suite for me. I've applied the changes and it's now submitted in the queue.
Is 'standard' the only queue that is open to me or can I also run brief test runs on the 'short' queue? A brief 20 minute test didn't submit successfully on 'short' but it was fine once I switched to 'standard'. Is there any development queue at all that I could use?

Many thanks,
Marcus

comment:19 Changed 3 years ago by ros

Hi Marcus,

The "short" queue only allows jobs using 8 nodes or less for a max of 20minutes. You will need to lower the processor count to fit it in. For testing the easy way to do that for this job is to change the number of OpenMP threads for the atmosphere to 1. Once you're happy it's working you can then change back to 2 threads.

Regards,
Ros.

comment:20 Changed 3 years ago by marcus

The suite is now running okay. Thanks, Ros, for all your help.
Best regards,
Marcus

comment:21 Changed 3 years ago by marcus

Hi Ros,

One last question (before closing the ticket): Where does the model output get written to?

In "rose config-edit > suite conf > Build and Run" I activated Post Processing = true (also for wallclock times and output logs).

As I usually do when running on MONSooN I've been looking for it on MASS (on JASMIN) using "moo ls -l moose:/crum/u-ae386/*' but this directory seems not to exist.

Many thanks,
Marcus

comment:22 Changed 3 years ago by ros

Hi Marcus,

You can only archive/write to MASS from MONSooN.

The output from this run is on ARCHER under /work/n02/n02/marcus/cylc-run/u-ae386/share/data/History_Data.

We are currently in the process of developing the post processing app to allow archiving to the RDF and then optional transfer to JASMIN.

Regards,
Ros.

comment:23 Changed 3 years ago by marcus

Many thanks, Ros. This is all clear now.
(ticket can be closed)

Best wishes,
Marcus

comment:24 Changed 3 years ago by ros

  • Resolution set to fixed
  • Status changed from accepted to closed
Note: See TracTickets for help on using tickets.