Opened 2 months ago

Closed 4 weeks ago

#2880 closed help (fixed)

Running nesting suite from ECMWF driving data on ARCHER

Reported by: mbexgcd2 Owned by: um_support
Component: Nesting Suite Keywords:
Cc: rosalyn.hatcher@… Platform: ARCHER
UM Version: 11.1

Description

Hello,

I've set up suite u-bh889 on PUMA in order to run the nesting suite on ARCHER at vn11.1, using ECMWF analyses and forecast files as the driving model. However, upon submission of the job I'm noticing that a couple of dependencies seem to be missing on ARCHER, specifically:

1) It seems that the 'ec_um_recon' jobs fail due to a missing grib_api library:

? Error message: Attempting to read GRIB data, but GRIB_API ifdef not present
? in build. Please note the GRIB API is currently the only method
? for decoding GRIB

I've tried adding 'module load grib_api' at various points in the ARCHER suite-adds.rc file, but it doesn't seem to make a difference - you can see my changes at https://code.metoffice.gov.uk/trac/roses-u/changeset?reponame=&new=113612%40b%2Fh%2F8%2F8%2F9%2Ftrunk&old=113326%40b%2Fh%2F8%2F8%2F9%2Ftrunk

2) Also the 'surf_smc' task fails with the following:

PrgEnv?-intel(3):ERROR:105: Unable to locate a modulefile for 'PrgEnv?-intel/5.2.40'
ModuleCmd_Load.c(244):ERROR:105: Unable to locate a modulefile for 'cray-snplauncher'

I wonder if you can advise on a solution to these issues?

Many thanks,
Chris.

Change History (18)

comment:1 Changed 7 weeks ago by grenville

Chris

Please see #2739 comment 18 - doing this should fix the grib_api issue.

I'm looking at (2).

Grenville

comment:2 Changed 7 weeks ago by grenville

Chris

Not sure what's going on in here:

/home/n02/n02/mbexgcd2/cylc-run/u-bh889/share/fcm_make_surf/build/bin/surf-env-init.sh

but that's where PrgEnv?-intel/5.2.40 appears to come from - there is no PrgEnv?-intel/5.2.40 on ARCHER.

Where does this suite come from?

Grenville

comment:3 Changed 7 weeks ago by mbexgcd2

Hi Grenville,

This is a copy of a suite that Stu Webster provided for me. It is the latest version of the nesting suite that works with ECMWF data as the driving model, and is the one that Stu recommended I use as the basis for my own simulations. Perhaps SURF hasn't been fully tested on ARCHER yet?

Chris.

comment:4 Changed 7 weeks ago by grenville

Hi Chris

Sounds right - I can see that other nesting suites refer to PrgEnv??-intel/5.2.40, but it doesn't cause them a problem; I guess surf is not built.

Grenville

comment:5 Changed 7 weeks ago by mbexgcd2

Hi Grenville,

Can you advise on the best way forward from here? Is the build something that CMS can help with? I know that surf is working on MONSooN but we only have access to HPC time on ARCHER for this particular project.

Thanks,
Chris.

comment:6 Changed 7 weeks ago by mbexgcd2

Hi Grenville,

Thinking about this again, perhaps I will try the grib_api fix first and see if this is enough to get it through the reconfiguration. I'll report back here once I've done the test.

Thanks for the help,
Chris.

comment:7 Changed 7 weeks ago by grenville

Hi Chris

As a very poor guess, I'd change PrgEnv???-intel/5.2.40 to one that ARCHER does have (te current default is PrgEnv?-intel/5.2.82)

Grenville

comment:8 Changed 7 weeks ago by mbexgcd2

Hi Grenville,

I tried the grib_api fix in suite u-bh889, by following the instructions in comment 18 of #2739 as suggested. However, I get an error during fcm-make:

[FAIL] ftn -obin/um-atmos.exe o/um_main.o -Llib -lum-atmos -h omp -L/work/y07/y07/umshared/gcom/cce8.5.8/gcom6.6/archer_xc30_cce_mpp/build/lib -lgcom -h omp -L/work/y07/y07/umshared/lib/cce-8.5.8/grib_api/1.28.0/lib -lgrib_api_f90 -lgrib_api -L/work/y07/y07/umshared/shumlib/shumlib-2018.06.1/ncas-xc30-crayftn-8.5.8-craycc-8.5.8/openmp/lib -lshum_wgdos_packing -lshum_string_conv -lshum_latlon_eq_grids -lshum_horizontal_field_interp -lshum_spiral_search -lshum_constants # rc=1
[FAIL] /opt/cray/cce/8.5.8/cray-binutils/x86_64-pc-linux-gnu/bin/ld: cannot find -ljasper
[FAIL] /opt/cray/cce/8.5.8/cray-binutils/x86_64-pc-linux-gnu/bin/ld: cannot find -ljasper
[FAIL] /opt/cray/hdf5/1.10.0.1/CRAY/8.3/lib/libhdf5.a(H5PL.o): In function `H5PLopen$$CFE_id_56395c9c_a2f1556b':
[FAIL] /b/ulib/hdf5-support/rpm/BUILD/cray-hdf5-1.10.0.1-201612052137.d5c01d2b84e7c-cce1-serial/hdf5-1.10.0-patch1/src/H5PL.c:614: warning: Using 'dlopen' in statically linked applications requires at runtime the shared libraries from the glibc version used for linking
[FAIL] link 9.8 ! um-atmos.exe ← um/src/control/top_level/um_main.F90
[FAIL] ! um-atmos.exe : update task failed

Do you have any ideas on how to fix this?

Thanks,
Chris.

comment:9 Changed 7 weeks ago by mbexgcd2

Hi Grenville, please ignore my last message - I've resolved the fcm_make problem now. It was a mistake in the way I had added the GRIB_API branch to um_sources in app/fcm_make/rose-app.conf. The executable has now compiled; the reconfiguration app is in the queue so I'll let you know what happens when it starts to run.

Cheers,
Chris.

comment:10 Changed 7 weeks ago by mbexgcd2

Hi Grenville,

So the grib_api fix works and the reconfiguration tasks show as 'succeeded' in the gcylc window.

I'm still having a problem with the surf_smc task. I've done a bit more investigating and it looks like the module load errors might be a red herring, since I've noticed a different error in the job.out file:

%Script SurfScr_SoilMoistureAnalysis starting at Wed May 1 11:51:37 UTC 2019 - (20190501115100)
XALT Error: unable to find aprun
SurfProg_SMC.exe FAILED

From what I can gather through looking at past tickets, SURF has been built on ARCHER and exists at /work/n02/n02/ros/SURF/SURF31.2.0. A very similar error to mine was reported in #1961 http://cms.ncas.ac.uk/ticket/1961, although the issue then was related to the surf_ostia task, not surf_smc. In my case, the surf_ostia task runs OK, it's just the surf_smc task that reports the issue. I wonder if Ros might know more on how to fix this?

Thanks,
Chris.

comment:11 Changed 5 weeks ago by grenville

Chris

Is this still a problem — if so, please point directly to the appropriate job.out and err files

Grenville

comment:12 Changed 4 weeks ago by mbexgcd2

Hi Grenville,

Yes I'm still having problems getting task surf_SMC to run. The job.out and job.err files can be found at:

/work/n02/n02/mbexgcd2/cylc-run/u-bi997/log/job/20151121T0000Z/SUMATRA_km2p2_RA1T_um_surf_smc/05

(the aprun error is located in the job.out file)

Thanks,
Chris.

comment:13 Changed 4 weeks ago by grenville

Chris

The suite is trying to launch a parallel job in the serial queue - the pp nodes don't support aprun. I'm not sure why your ncas config file refers to the cray-snplauncher - take a look at u-ba621, which is set up to send CAP jobs (for example) to the compute nodes - which sounds like what you want to do with surf_smc

Grenville

comment:14 Changed 4 weeks ago by mbexgcd2

Hi Grenville,

As far as I can tell, I think the SURF program is meant to run on the serial nodes on ARCHER - see http://cms.ncas.ac.uk/ticket/1961#comment:3. I've not been able to isolate the offending aprun command in the surf_smc scripts though.

For info, my suite points to Ros' SURF installation at /work/n02/n02/ros/SURF/SURF31.2.0/share/fcm_make_surf_xc30_x86_64_ifort_opt, which provides the surf_ostia program as well as surf_smc. The surf_ostia app runs fine for me on the serial nodes (see http://cms.ncas.ac.uk/ticket/1961#comment:4), but the aprun issue for surf_smc still remains. I wondered if Ros might know how to fix this, given that she resolved the same issue with surf_ostia as part of #1961?

Chris.

comment:15 Changed 4 weeks ago by grenville

Chris

The problem is in

/work/n02/n02/mbexgcd2/cylc-run/u-bi997/share/fcm_make_surf/build/bin/SurfScr_SoilMoistureAnalysis

where it says:

${SURF_LAUNCHER:-rose mpi-launch} $SMC_PRG_EXEC

see SurfScr_OSTIA2NWP (same directory) for the fix.

Grenville

comment:16 Changed 4 weeks ago by mbexgcd2

Thanks Grenville. I think the SurfScr_SoilMoistureAnalysis file gets copied across from the central version at /work/n02/n02/ros/SURF/SURF31.2.0/share/fcm_make_surf_xc30_x86_64_ifort_opt/build/bin, so if it's possible to implement the fix in Ros' version that would be great.

As a temporary workaround, I've created my own copy of the SURF installation at /work/n02/n02/mbexgcd2/SURF/SURF31.2.0 and I've updated my suite to point to this instead, so that I can implement the fix locally.

Chris.

comment:17 Changed 4 weeks ago by grenville

Chris

Fixed in Ros' version.

Grenville

comment:18 Changed 4 weeks ago by mbexgcd2

  • Resolution set to fixed
  • Status changed from new to closed

Brilliant, thanks Grenville! I can confirm that the surf_smc task is working now on ARCHER, so I'll go ahead and close the ticket as resolved.

Note: See TracTickets for help on using tickets.