Opened 4 months ago

Closed 4 months ago

#3203 closed help (fixed)

FCM_MAKE_FILE: unbound variable

Reported by: ggxmy Owned by: um_support
Component: UM Model Keywords:
Cc: Platform: Monsoon2
UM Version: 10.7

Description

Continued from #3202, u-br927 is my attempt to port bj611 to Monsoon. when I tried to run it, fcm_make_um failed and I got job.err like this;

/home/d03/myosh/.bash_profile: line 14: /home/d03/myosh/.ssh/ssh-setup: No such file or directory
/usr/lib64/python2.6/site-packages/requests/packages/urllib3/connection.py:337: SubjectAltNameWarning: Certificate for xcslc1 has no `subjectAltName`, falling back to check for a `commonName` for now. This feature is being removed by major browsers and deprecated by RFC 2818. (See https://github.com/shazow/urllib3/issues/497 for details.)
  SubjectAltNameWarning
[FAIL] file:fcm-make.cfg=source: FCM_MAKE_FILE: unbound variable
2020-02-24T17:03:12Z CRITICAL - failed/EXIT
/usr/lib64/python2.6/site-packages/requests/packages/urllib3/connection.py:337: SubjectAltNameWarning: Certificate for xcslc1 has no `subjectAltName`, falling back to check for a `commonName` for now. This feature is being removed by major browsers and deprecated by RFC 2818. (See https://github.com/shazow/urllib3/issues/497 for details.)
  SubjectAltNameWarning

Could you please help me solving this issue?

Thanks,
Masaru

Attachments (1)

moose_archiving.jpeg (93.7 KB) - added by ggxmy 4 months ago.

Download all attachments as: .zip

Change History (24)

comment:1 Changed 4 months ago by ros

Hi Masaru,

This is all related to the setting up of the suite configuration files due to the change of platform so it's fine to leave in the same ticket so it's easier for us to see the history of what changes have been made.

I'll dig out the code to fix this from another suite - the problem is due to the different ways of running fcm_make depending on whether it's two step or 1 step. I'll also take a quick look to see if there is anything else obvious that will need changing.

Cheers,
Ros.

comment:2 Changed 4 months ago by ros

Ok it's easy than I thought - already ther just need to In suite.rc file modify the line:

{% set SINGLE_FCMUM = ['meto_cray'] %}

to be

{% set SINGLE_FCMUM = ['meto_cray', 'monsoon'] %}

Cheers,
Ros

comment:3 Changed 4 months ago by ggxmy

Hi Ros.,

That's great. Now this error has gone and the suite is running for a few minutes now. I'll report how it will have gone.

Masaru

comment:4 Changed 4 months ago by ggxmy

Now Monsoon is back online. The run is found to be failed for no surprise.

coupled/NN/job.err has these lines;

_pmiu_daemon(SIGCHLD): [NID 03828] [c5-1c2s13n0] [Tue Feb 25 01:21:56 2020] PE RANK 463 exit signal Aborted
atpAppSigHandler: Back-end never delivered its pid. Re-raising signal.
[NID 03828] 2020-02-25 01:21:56 Apid 98824552: initiated application termination
[FAIL] run_model # return-code=137
2020-02-25T01:22:01Z CRITICAL - failed/EXIT

This appears to be a similar problem to #3053 but I can't follow it much further to the details. What can I do?

Thanks,
Masaru

comment:5 Changed 4 months ago by ggxmy

I found lines suggesting errors at the end of ocean.output (in /projects/ukca-leeds/myosh/cylc-run/u-br927/work/20150101T0000Z/coupled/ ).

So Antarctica iceshelf melting seems to have got invalid values. Also U wind has got an unphysical value? Has this caused a failure? How can it be avoided?

output.abort_00??.nc cannot be read with ncdump now because postproc is down now.

fld_read: var sofwfisf kt =        3 (   0.0781 days), Y/M/D = 2015/01/01, records b/a:   0001/  0001 (days -180.0000/ 180.0000)
 it_offset is :  0
 Greenland iceshelf melting climatology (kg/s) :  0.
 Greenland iceshelf melting adjusted value (kg/s) :  0.
 Antarctica iceshelf melting climatology (kg/s) :  -39542110.62454956
 Antarctica iceshelf melting adjusted value (kg/s) :  -31311500.000000004

 ===>>> : E R R O R
         ===========

  stpctl: the zonal velocity is larger than 20 m/s
  ======
 kt=     3 max abs(U):   5190.    , i j k:   171   82    1

           output of last fields in numwso

 ===>>> : E R R O R
         ===========

 step: indic < 0

 dia_wri_state : single instantaneous ocean state
 ~~~~~~~~~~~~~   and forcing fields file created
                 and named :output.abort                    .nc

 ===>>> : E R R O R
         ===========

 MPPSTOP
 NEMO abort from dia_wri_state
 E R R O R: Calling mppstop

Thanks,
M

comment:6 Changed 4 months ago by grenville

Masaru

Did this job work on the MO machine?

the zonal velocity is larger than 20 m/s - this is a NEMO catch-all error

Grenville

comment:7 Changed 4 months ago by ggxmy

Hi Grenville,

I heard that Rob Chadwick had run this suite (bj611) on the Met Office Cray. I tried running the suite with reconfiguration turned off as suggested by Martin Andrews at Met Office but the result was exactly the same.

Thanks,
Masaru

comment:8 Changed 4 months ago by ggxmy

Martin Andrews kindly checked the suite by copying it and porting back to Met Office Cray. He said it runs fine there. So there must be an issue with Monsoon or something in the changes when porting to Monsoon.

The change set from my suite that doesn't run on Monsoon and Martin's that does run on Met Office Cray is here;

https://code.metoffice.gov.uk/trac/roses-u/changeset?reponame=&new=150357%40b%2Fs%2F1%2F6%2F0&old=150344%40b%2Fs%2F1%2F6%2F0

Please could I have any suggestion on what I can try?

Masaru

comment:9 Changed 4 months ago by ros

Hi Masaru,

I've copied Martin's suite u-bs160 over from the Met Office including the input files and it has run the first cycle 20150101 successfully on Monsoon.

The suite working copy is at: ~rhatcher/roses/u-bs160
And the input files I copied over are at: /projects/umadmin/rhatcher/u-bs160_test

Hopefully this will help you to work out what the difference is.

Cheers,
Ros.

comment:10 Changed 4 months ago by ggxmy

Thank you so much, Ros. And it runs on Monsoon? Unbelievable!!!

My suite and your working copy appear to be almost identical. The only differences were USE_DEFAULT_ACCOUNT and SUBPROJECT.

1. changed USE_DEFAULT_ACCOUNT only;
USE_DEFAULT_ACCOUNT=false, ACCOUNT_USR='climate' ⇐changed
SUBPROJECT=other, SUBPROJECT_OTHER=asci ⇐=======unchanged

I got submission failures.

2. changed SUBPROJECT only;
USE_DEFAULT_ACCOUNT=true ⇐unchanged
SUBPROJECT=cmip6 ⇐=======changed

I got these messages

[FAIL] Could not import widget: Could not retrieve class stash.StashCodeChooserValueWidgetv1
[FAIL] Could not import widget: 'NoneType' object is not callable

but otherwise exactly the same failure as before.

3. changed both

I got submission failures.

So this is probably a problem specific to me? Can you see anything strange in my home directory ( /home/d03/myosh/ )?

Masaru

comment:11 Changed 4 months ago by ros

Hi Masaru,

The USE_DEFAULT_ACCOUNT, ACCOUNT_USR & SUBPROJECT are irrelvant as they are for HPC accounting purposes and don't affect how the model runs. I need to set which Monsoon project I run under, as I have access to several hence those changes. I would expect setup (1) and (3) to fail for you.

Can you try running the suite using the input files that Martin used which are in my directory u-bs160_test?

Cheers,
Ros.

comment:12 Changed 4 months ago by ggxmy

I'm actually running the suite using those files. I had thought you copied Martin's files that are copies of my files. But it happened to me that they may somehow be different.

And the suite is running for 15 minutes now, which is longer than ever! This could be a workaround. Maybe I could continue to use those files. But you have any idea what was the problem and what can be the solution?

Masaru

comment:13 Changed 4 months ago by ggxmy

The astart files may not be exactly the same.

$ll /projects/ukca-leeds/myosh/dumps/bg466a.da20150101_00
-rw-r--r-- 1 myosh ukca-leeds 7012302848 Feb 24 18:45 /projects/ukca-leeds/myosh/dumps/bg466a.da20150101_00
$ ll /projects/umadmin/rhatcher/u-bs160_test/bg466a.da20150101_00
-rw-r--r-- 1 rhatcher umadmin 7012302848 Mar  3 09:13 /projects/umadmin/rhatcher/u-bs160_test/bg466a.da20150101_00
$ diff /projects/ukca-leeds/myosh/dumps/bg466a.da20150101_00 /projects/umadmin/rhatcher/u-bs160_test/bg466a.da20150101_00
Files /projects/ukca-leeds/myosh/dumps/bg466a.da20150101_00 and /projects/umadmin/rhatcher/u-bs160_test/bg466a.da20150101_00 differ

They have the same file size but diff says they differ. Other files seem to be identical.

comment:14 Changed 4 months ago by ros

Hi Masaru,

I suggest maybe you talk to Martin about exactly what he did regarding those files. I assumed too that they were just copies of your files, but everything you said was pointing to an input file issue so I wanted to double check.

Sounds like you may have found which file is the issue. I would suggest using cumf to compare the files.

Cheers,
Ros.

comment:15 Changed 4 months ago by ggxmy

Thanks, Ros.,

So you copied Martin's files.

What is cumf? It doesn't seem to be available on either xcslc0 or exppostproc01 on Monsoon.

In any case, I could back up my astart and copy your astart to my directory and try again.

The run crashed. The process 'coupled' ran for about 30 minutes and postproc failed. This is probably because my Moose account has been expired yesterday. I requested to reactivate it so it should become accessible soon.

Anyway, finally some progress!

Masaru

comment:16 Changed 4 months ago by ggxmy

  • Resolution set to fixed
  • Status changed from new to closed

I copied Ros.' astart file to my directory and changed all settings that was pointing to her directory back to point to my directory. The suite ran until postproc crashed. So it looks like even though both Martin and I downloaded the same file from the same place (MASS) and Ros. copied Martin's file, my astart file was likely to be somehow corrupted.

The problem now is probably my Moose account.

Closing this ticket for now. Thank you very much for your help.

Masaru

comment:17 Changed 4 months ago by ggxmy

  • Resolution fixed deleted
  • Status changed from closed to reopened

I'm reopening this ticket.

Actually even though I cannot access Moose from JASMIN, I found I can still access there from Monsoon. So I'm not sure if the failure of postproc is due to my account issue any more.

I checked postproc_atmos/01/job.err and it says;

[WARN]  mkset: System error (Error=2)
mkset command-id=887586754 failed: (SSC_TASK_REJECTION) one or more tasks are rejected.
  moose:/crum/u-br927: (TSSC_PROJECT_NAME_REQUIRED) A project name must be specified.
mkset: failed (2)

         Unable to create set:moose:crum/u-br927
[WARN]  [SUBPROCESS]: Command: moo put -f -vv /home/d03/myosh/cylc-run/u-br927/share/data/History_Data/br927a.pn20150111.pp moose:crum/u-br927/apn.pp
[SUBPROCESS]: Error = 2:
        put command-id=887586792 failed: (SSC_TASK_REJECTION) one or more tasks are rejected.
  /home/d03/myosh/cylc-run/u-br927/share/data/History_Data/br927a.pn20150111.pp -> moose:/crum/u-br927/apn.pp/br927a.pn20150111.pp: (TSSC_SET_DOES_NOT_EXIST) no such data set.
put: failed (2)

I set USE_DEFAULT_ACCOUNT=true, SUBPROJECT=other, SUBPROJECT_OTHER=asci, FUNDING=hccp. Aren't these OK?

Even though the suite u-br927 never looks like it is committed for some reason, it actually is. Latest revision is 151034.

Masaru

comment:18 Changed 4 months ago by ros

Hi Masaru,

You need to specify the project in the mooproject box in the "postproc → Moose archiving" panel.

Cheers,
Ros.

Last edited 4 months ago by ros (previous) (diff)

comment:19 Changed 4 months ago by ggxmy

Hi Ros.,

Attached is from my vn11.1 UKESM1 suite. Can I set this way and $MOOPROJECT will have a proper value?

Thanks,
Masaru

Changed 4 months ago by ggxmy

comment:20 Changed 4 months ago by grenville

see /home/d00/hadlk/roses/u-br633 - copy Luke's setup

comment:21 Changed 4 months ago by ggxmy

Thank you Grenville.

OK. It looks like I can add these lines

        [[[environment]]]
            MOOPROJECT = {{MONSOON_MOOSE_PROJECT}}

in [[POSTPROC_RESOURCE]] in site/monsoon.rc

MONSOON_MOOSE_PROJECT is given in rose-suite.conf in u-br633. But rose-suite.conf in u-br927 doesn't look very similar to that in u-br633. Can I simply add something like these there?

MONSOON_ACCOUNT='other'
MONSOON_ACCOUNT_DEFAULT=false
MONSOON_ACCOUNT_OTHER='ukca-cam'
MONSOON_AINITIAL_DIR='/projects/ukca-cam/hadlk/restarts/AMIP'
MONSOON_ARCHIVE_DUPLEX=false
MONSOON_MOOSE_PROJECT='project-ukca'
MONSOON_PROJECT='ukca-cam'
MONSOON_QUEUE='normal'

Masaru

comment:22 Changed 4 months ago by ros

Hi Masaru,

u-br927 is setup differently to both u-br633 & your vn11.1 suite. If it was me I'd try setting mooproject to project-ukca in the rose suite GUI to start with and go for the more complicated setup if that doesn't work.

Cheers,
Ros.

comment:23 Changed 4 months ago by ggxmy

  • Resolution set to fixed
  • Status changed from reopened to closed

Cool. I did that and rose-suite.conf now has

MONSOON_MOOSE_PROJECT='project-asci'

and postproc successfully submitted and executed!

Thank you very much for your help!
Masaru

Note: See TracTickets for help on using tickets.