Opened 5 months ago

Last modified 5 months ago

#3540 assigned error

JULES compilation error on jasmin

Reported by: jmackay Owned by: jules_support
Component: JULES Keywords: jules, fcm
Cc: Platform: JASMIN
UM Version:

Description

Hello all,

When I run my rose suite (u-ce887) on JASMIN, it fails during the fcm step (compiling JULES v6.0). I'm using the "jasmin-lotus-intel" platform. My suspicion is that I don't have my suite.rc file setup correctly, but I'm not certain about this. I've attached the job.err file from fcm.

Any ideas what's going on?

Thanks all!

Attachments (2)

job.err (17.2 KB) - added by jmackay 5 months ago.
fcm error log file
suite.rc (2.3 KB) - added by jmackay 5 months ago.
suite.rc file

Download all attachments as: .zip

Change History (8)

Changed 5 months ago by jmackay

fcm error log file

comment:1 Changed 5 months ago by jmackay

  • Owner set to <DEFAULT>
  • Status changed from new to assigned

comment:2 Changed 5 months ago by jmackay

  • Owner changed from <DEFAULT> to jules_support

comment:3 Changed 5 months ago by dcase

Jonathan,

you're using quite old libraries (intel 14), and there are newer ones which are installed in the group workspace.

If you put something like this in your suite.rc, you should be able to use them.

        env-script = """
	                eval $(rose task-env)
	                export PATH=/apps/jasmin/metomi/bin:$PATH
	                module load intel/19.0.0
	                module load contrib/gnu/gcc/7.3.0
	                module load eb/OpenMPI/intel/3.1.1
	                module list 2>&1
	                export NETCDF_FORTRAN_ROOT=/gws/nopw/j04/jules/admin/netcdf/local_nc_par/3.1.1/intel.19.0.0/
	                export NETCDF_ROOT=/gws/nopw/j04/jules/admin/netcdf/local_nc_par/3.1.1/intel.19.0.0/
	                export HDF5_LIBDIR=/gws/nopw/j04/jules/admin/netcdf/local_nc_par/3.1.1/intel.19.0.0/lib
	                export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$HDF5_LIBDIR
	                env | grep LD_LIBRARY_PATH
	                """
	
	        [[[job]]]
	            batch system = slurm
	        [[[directives]]]
	            --partition = par-multi
	            --constraint="ivybridge128G|skylake348G|broadwell256G"
	        [[[environment]]]
	            ROSE_LAUNCHER = mpirun
	            NETCDF_FORTRAN_ROOT=/gws/nopw/j04/jules/admin/netcdf/local_nc_par/3.1.1/intel.19.0.0/
	            NETCDF_ROOT=/gws/nopw/j04/jules/admin/netcdf/local_nc_par/3.1.1/intel.19.0.0/

Note - you should also set this up for runtime (as well as building) as you'll still need to link to the libraries.
The business of constraints is because JASMIN has a few AMD nodes, and these don't like the executable which you get by building on Intel.

Hope this is useful,

Dave

comment:4 Changed 5 months ago by jmackay

Thanks Dave,

I'm now able to compile JULES, thanks!

However, it now fails when submitting the JULES job. In the cylc gui it says "submit-failed". The only log file I can find is the job-activity.log and it contains the following:

[jobs-submit cmd] cylc jobs-submit —utc-mode — /home/users/jmac87/cylc-run/u-ce887/log/job 1/jules/01
[jobs-submit ret_code] 255
[jobs-submit out] 2021-05-20T10:58:20Z|1/jules/01|255|None
2021-05-20T10:58:20Z [STDERR] sbatch: invalid option — '='
2021-05-20T10:58:20Z [STDERR] Try "sbatch —help" for more information
[(('event-mail', 'submission failed'), 1) ret_code] 0

So it looks like the actual slurm submission is failing. I guess this boils down to teh suite.rc file again so I've attached the latest version.

Many thanks for your help with this!

Jon

Changed 5 months ago by jmackay

suite.rc file

comment:5 Changed 5 months ago by dcase

When JASMIN switched from lsf to SLURM, they put some documentation online to help people convert:

https://help.jasmin.ac.uk/article/4891-lsf-to-slurm-quick-reference
To start off, I'd put something like this:

        [[[job]]]
	            batch system = slurm
	        [[[directives]]]
	            --partition = par-multi
	            --constraint="ivybridge128G|skylake348G|broadwell256G"
                    --ntasks = {{ some no of tasks }}

and get extras from the documentation. Again, the constraint is because of the heterogeneous architecture on JASMIN, and I'd keep the environment that you built with so you can link to the NETCDF at runtime where needed.

comment:6 Changed 5 months ago by jmackay

Yes you hit the nail on the head, thanks for this. It appears I had some old lsf commands in my suite.rc file (e.g. -W when I should have been using —time). I've updated these and can now confirm the full rose suite (including JULES) now runs successfully.

Many thanks

Jon

Note: See TracTickets for help on using tickets.