Opened 6 months ago

Closed 3 months ago

#3376 closed help (fixed)

building JULES

Reported by: yaogao Owned by: jules_support
Component: JULES Keywords: JULES, JASMIN
Cc: Platform: JASMIN
UM Version:

Description

Hi Patrick,

Sorry to disturb you.

I am a JULES user far away in Finland. I am running JULES on Jasmin. I can run the trunk version, e.g. vn5.8, but I failed in build up step when running branch version e.g. vn5.4_permafrost or vn5.8 permafrost. However, eight months ago, vn5.4_permafrost works. So I want to ask you could it be related to some Jasmin setting? I attach the job.error content and rose-suite.conf file below. I would be really appreciate if you would like to help me. BTW, I also wonder is there any person or place to ask for help especially?

Thank you very much,

Yao

I login Jasmin by:
ssh -A -X ygao003@…
ssh -Y jasmin-cylc.ceda.ac.uk

Error message:

ssh -XA USERNAME@… :
Environment variables set for netCDF Fortran bindings in

/apps/libs/netCDF/intel14/fortran/4.2/

You will also need to link your code to a compatible netCDF C library in

/apps/libs/netCDF/intel14/4.3.2/

[FAIL] mpif90 -obin/jules.exe o/jules.o -L/tmp/HsiANiI7V7 -ljules -L/apps/libs/netCDF/intel14/4.3.2/lib -L/apps/libs/netCDF/intel14/fortran/4.2/lib -L/group_workspaces/jasmin2/jules/admin/curl/curl-lotus-parallel-intel/lib/ -L/apps/libs/PHDF5/intel14/1.8.12/lib -lnetcdff -lnetcdf -lhdf5_hl -lhdf5 -lcurl -heap-arrays -fp-model precise -traceback # rc=1
[FAIL] ifort: command line warning #10212: -fp-model precise evaluates in source precision with Fortran.
[FAIL] ld: cannot find -lcurl
[FAIL] link 0.9 ! jules.exe ← jules/src/control/standalone/jules.F90
[FAIL] ! jules.exe : update task failed

[FAIL] fcm make -f /work/scratch-pw/ygao003/cylc-run/u-an231/work/1/fcm_make/fcm-make.cfg -C /home/users/ygao003/cylc-run/u-an231/share/fcm_make -j 4 # return-code=255
2020-09-17T20:19:53Z CRITICAL - failed/EXIT

rose-suite.conf

[jinja2:suite.rc]
ANCIL_PATH_JAS='/home/users/ygao003/roses/u-an231/ancillaries_v2'
ANCIL_PATH_METO='/data/cr1/hadea/PAGE21_sites/ancillaries/'
BUILD=true
DRIVE_PATH_JAS='/home/users/ygao003/driving_v2'
DRIVE_PATH_METO='/data/cr1/hadea/PAGE21_sites/final/'
FIRST_RUN=true
HOST='spice'
INCLUDE_C20C=true
INCLUDE_RCP4P5=false
INCLUDE_RCP8P5=false
INCLUDE_SPINUP=true
JULES_FCM='fcm:jules.x_br/dev/eleanorburke/vn5.4_permafrost'
JULES_REVISION='14509'
LOCATION='jasmin'
NSPINUPS=25
OUTPUT_PATH_JAS='/home/users/ygao003/OUTPUT/'
OUTPUT_PATH_METO='/home/users/ygao003/OUTPUT/'
PARAM_FILE='sites.dat.all'
PRECIP_DRIVING=false
PRESCRIBED_ISUNFROZEN=false

Change History (25)

comment:1 Changed 6 months ago by pmcguire

Hi Yao:
You're using the wrong machine and also the wrong group workspace. Those have both been updated from jasmin-cylc to cylc1.jasmin and from /group_workspaces/jasmin2/jules to /gws/nopw/j04/jules .
Have you updated your setup according to the updated tutorial page I sent a couple of days ago?

Can I create an NCAS CMS helpdesk ticket for you about this topic? You can either ask questions at the helpdesk (which is publicly searchable) or at the jules-users email list.
Do you already have an NCAS CMS helpdesk account? If so, what is your username?
Patrick

Last edited 6 months ago by pmcguire (previous) (diff)

comment:2 Changed 6 months ago by pmcguire

HI Patrick,

OK, so you think I need to login to cycl1.jasmin. I did not use group workspace but used my home directory for driving and ancillary files for vn5.4_permafrost.
I have not do the setup you sent. However, I just found vn5.8_permafrost worked.

I do not have a NCAS CMS helpdesk account. Can I apply one?

Many thanks,
Yao

comment:3 Changed 6 months ago by pmcguire

Hi Yao:
Yes, I can ask for an NCAS CMS helpdesk account for you.

You can see in the error message that you gave me that you're using the old jules group workspace since you're using an old version of JULES.
Patrick

comment:4 Changed 6 months ago by pmcguire

Hi Patrick,

It would be great that you are helping to ask for an NCAS CMS account.

Now I see the error with group workspace, but where to set it? Is this the tutorial page 1 you mentioned https://research.reading.ac.uk/landsurfaceprocesses/software-examples/tutorial-rose-cylc-jules-on-jasmin/ ?
And I can find answer from here?

Best regards,
Yao

comment:5 Changed 6 months ago by pmcguire

Hi Gao:
Yes that's the tutorial page.
Patrick

comment:6 Changed 6 months ago by pmcguire

Hi Yao:
The group workspace path is hardwired in the JULES FORTRAN code.
Do you have that FORTRAN code downloaded and you use that version?
Or do you download it on the fly each time you run JULES?
Patrick

comment:7 Changed 6 months ago by pmcguire

I see, I have the JULES CODE of vn5.4_permafrost downloaded in my home directory, but I am linking to the branch as you see in the rose-suite.conf. Does this mean I am download it on the fly each time when run JULES? I can write JULES_FCM to link to my home directory code.

Best regards,
Yao

comment:8 Changed 6 months ago by pmcguire

You'll have to make modifications to the JULES_FCM branch that you're using
JULES_FCM='fcm:jules.x_br/dev/eleanorburke/vn5.4_permafrost'
For example, you can check out a copy of that branch and modify it to have the right
jules group workspace for JASMIN, and then check that copy of the branch back in, and then use the new branch. If you use grep, you can find out where the group workspace is hardwired in the branch.

Patrick

comment:9 Changed 6 months ago by pmcguire

So connect to the version code on my home directory would not work? I have tried to modify the code in my home directory and linked the rose-meta.conf there but it showed the same problem.

Moreover, I tried to login to cycl1.jasmin and followed the setup here https://code.metoffice.gov.uk/trac/jules/wiki/RoseJULESonJASMIN, but what has been working in jasmin-cylc does not work here. The reason is that it cannot linked to metoffice repository.

[FAIL] config-file=/work/scratch-pw/ygao003/cylc-run/u-bx728/work/1/fcm_make/fcm-make.cfg:2
[FAIL] config-file= - https://code.metoffice.gov.uk/svn/jules/main/branches/dev/eleanorburke/vn5.8_permafrost/etc/fcm-make/make.cfg@18072
[FAIL] https://code.metoffice.gov.uk/svn/jules/main/branches/dev/eleanorburke/vn5.8_permafrost/etc/fcm-make/make.cfg@18072: cannot load config file
[FAIL] https://code.metoffice.gov.uk/svn/jules/main/branches/dev/eleanorburke/vn5.8_permafrost/etc/fcm-make/make.cfg@18072: not found
[FAIL] svn: E170013: Unable to connect to a repository at URL 'https://code.metoffice.gov.uk/svn/jules/main/branches/dev/eleanorburke/vn5.8_permafrost/etc/fcm-make/make.cfg'
[FAIL] svn: E215004: No more credentials or we tried too many times.
[FAIL] Authentication failed

[FAIL] fcm make -f /work/scratch-pw/ygao003/cylc-run/u-bx728/work/1/fcm_make/fcm-make.cfg -C /home/users/ygao003/cylc-run/u-bx728/share/fcm_make -j 4 # return-code=1
2020-09-18T12:14:13Z CRITICAL - failed/EXIT

Best regards,
Yao

comment:10 Changed 6 months ago by pmcguire

Hi Yao
Did you follow step 4 of the tutorial? It tells in the MetOffice link how to setup your MOSRS password caching with the new hostnames.
Patrick

comment:11 Changed 6 months ago by pmcguire

Opps, sorry I wasnot careful enough while reading your tutorial. After do step4, vn5.8_permafrost which works on jasmin-cylc works on cylc1.jasmin now.

vn5.4_permafrost still does not work, even though I link fcm to the vn5.4_permafrost on my home directory, and changed group_workspaces to gws… like the error below now…

Maybe I should ask the NCAS helpdesk rather than disturb you now as I got an account.

Thank you for your tremendous help today!!! Make me feel much better out of the mess!

Cheers,
Yao

[FAIL] /home/users/ygao003/cylc-run/u-an231/share/fcm_make/fcm-make.lock: lock exists at the destination

[FAIL] fcm make -f /work/scratch-pw/ygao003/cylc-run/u-an231/work/1/fcm_make/fcm-make.cfg -C /home/users/ygao003/cylc-run/u-an231/share/fcm_make -j 4 # return-code=255
2020-09-18T14:19:24Z CRITICAL - failed/EXIT

comment:12 Changed 6 months ago by pmcguire

Hi Yao:
I am glad vn5.8_permafrost works now.

Based upon that error message "[FAIL] /home/users/ygao003/cylc-run/u-an231/share/fcm_make/fcm-make.lock: lock exists at the destination" you might just try to do a rose suite-run --new
for the v5.4 local copy.

I can put this all on the helpdesk. Don't worry.
Patrick

comment:13 Changed 6 months ago by pmcguire

After using rose suite-run --new, could you speculate what are those? Something wrong with the link to netCDF library?

[FAIL] mpif90 -oo/water_constants_mod.o -c -DSCMA -DBL_DIAG_HACK -DINTEL_FORTRAN -I./include -I/apps/libs/netCDF/intel14/fortran/4.2/include -heap-arrays -fp-model precise -traceback /home/users/ygao003/cylc-run/u-an231/share/fcm_make/preprocess/src/jules/src/params/standalone/water_constants_mod_jls.F90: command not found
[FAIL] compile 0.0 ! water_constants_mod.o ← jules/src/params/standalone/water_constants_mod_jls.F90
[FAIL] mpif90 -oo/veg_param.o -c -DSCMA -DBL_DIAG_HACK -DINTEL_FORTRAN -I./include -I/apps/libs/netCDF/intel14/fortran/4.2/include -heap-arrays -fp-model precise -traceback /home/users/ygao003/cylc-run/u-an231/share/fcm_make/preprocess/src/jules/src/science/params/veg_param_mod.F90: command not found
[FAIL] compile 0.0 ! veg_param.o ← jules/src/science/params/veg_param_mod.F90
[FAIL] mpif90 -oo/u_v_grid.o -c -DSCMA -DBL_DIAG_HACK -DINTEL_FORTRAN -I./include -I/apps/libs/netCDF/intel14/fortran/4.2/include -heap-arrays -fp-model precise -traceback /home/users/ygao003/cylc-run/u-an231/share/fcm_make/preprocess/src/jules/src/control/standalone/var/u_v_grid.F90: command not found
[FAIL] compile 0.0 ! u_v_grid.o ← jules/src/control/standalone/var/u_v_grid.F90
[FAIL] mpif90 -oo/trif_vars_mod.o -c -DSCMA -DBL_DIAG_HACK -DINTEL_FORTRAN -I./include -I/apps/libs/netCDF/intel14/fortran/4.2/include -heap-arrays -fp-model precise -traceback /home/users/ygao003/cylc-run/u-an231/share/fcm_make/preprocess/src/jules/src/control/shared/trif_vars_mod.F90: command not found
[FAIL] compile 0.0 ! trif_vars_mod.o ← jules/src/control/shared/trif_vars_mod.F90
[FAIL] mpif90 -oo/trif.o -c -DSCMA -DBL_DIAG_HACK -DINTEL_FORTRAN -I./include -I/apps/libs/netCDF/intel14/fortran/4.2/include -heap-arrays -fp-model precise -traceback /home/users/ygao003/cylc-run/u-an231/share/fcm_make/preprocess/src/jules/src/science/params/trif_mod.F90: command not found
[FAIL] compile 0.0 ! trif.o ← jules/src/science/params/trif_mod.F90
[FAIL] mpif90 -oo/trifctl.o -c -DSCMA -DBL_DIAG_HACK -DINTEL_FORTRAN -I./include -I/apps/libs/netCDF/intel14/fortran/4.2/include -heap-arrays -fp-model precise -traceback /home/users/ygao003/cylc-run/u-an231/share/fcm_make/preprocess/src/jules/src/control/shared/trifctl.F90: command not found
[FAIL] compile 0.0 ! trifctl.o ← jules/src/control/shared/trifctl.F90

comment:14 Changed 6 months ago by pmcguire

Hi Yao:
Not sure exactly. But I suspect that it does have to do with the MPI libraries, especially since maybe the suite is supposed to run on the SLURM batch nodes now instead of the LSF batch nodes, due to the upgrade to JASMIN this week.
Patrick

comment:15 Changed 6 months ago by pmcguire

Hi Patrick,

Many thanks again! I decide to shift to vn5.8_permafrost for now…

Have a nice weekend!
Yao

comment:16 Changed 6 months ago by yaogao

Hi Patrick,

After our discussions on last Friday, I tried to run vn5.8_permafrost and vn5.8 on Cylc1 machine but I got the MPI errors like I showed in the end on last Friday. However, vn5.8 and vn5.8_permafrost can be run on jasmine-cylc machine…So something with the hard code of JULES about setting up was not right?

Best regards,
Yao

comment:17 Changed 6 months ago by pmcguire

Hi Yao:
Do you use Rose/Cylc suites? What suite are you working with?

As I described in CMS Helpdesk ticket #3377, I was able to get JULES built and running with MPI turned on and with updated NETCDF libraries, with the GL7 gridded suite for SLURM. I updated that suite. (This SLURM version of the u-bb316 GL7 suite was developed by a couple of other people; I had to modify it further with input from one of those people). The SLURM version is u-bx723. This suite uses new MPI libraries and new NETCDF libraries (both for SLURM). Does that help?

If it doesn't help and you can't see the changes in that suite that you need in order to get SLURM working with MPI with your suite, let me know, and I will try help further.
Patrick

comment:18 Changed 5 months ago by yaogao

Hi Patrick,

Sorry I still need to ask you about this issue. I am still having this MPI error in the BUILD UP session. However, strange to me, yesterday night, the build up session is ok, but in the JULES spinup/run session, it showed me some error like below with the job submission(in the bottom). Before this, I got outputs from JULES on jasmin-cylc machine last week (before Sept. 25th).

I had a look at the u-bx723 suite.I am not sure, would it solve the problem if I add all the parts related to MPI in suite.rc and also rose-suite.conf as in u-bx723? I donot understand all the code in suite.rc. yet. My Rose/Cylc? suite number is u-bx728. I think you have right to read it. Would it be ok for you to take time to have a look at my suite?

Thank you very much!
Yao

[jobs-submit cmd] cylc jobs-submit —utc-mode — /home/users/ygao003/cylc-run/u-bx728/log/job 1/auchencorth_main/01 1/ca_wp1_main/01 1/degero_main/01 1/kopytkowo_main/01 1/lompolojankka_main/01 1/merbleue_main/01 1/siikaneva_main/01
[jobs-submit ret_code] 1
[jobs-submit out] 2020-10-01T21:15:07Z|1/kopytkowo_main/01|1|None
2020-10-01T21:15:07Z [STDERR] [FAIL] 'CYLC_TASK_ID'
2020-10-01T21:15:07Z [STDOUT] There was an error running the Slurm sbatch command.
2020-10-01T21:15:07Z [STDOUT] The command was:
2020-10-01T21:15:07Z [STDOUT] '/usr/bin/sbatch —wrap="#!/bin/bash -l
2020-10-01T21:15:07Z [STDOUT] #
2020-10-01T21:15:07Z [STDOUT] # ++++ THIS IS A CYLC TASK JOB SCRIPT ++++
2020-10-01T21:15:07Z [STDOUT] # Suite: u-bx728
2020-10-01T21:15:07Z [STDOUT] # Task: kopytkowo_main.1
2020-10-01T21:15:07Z [STDOUT] # Job log directory: 1/kopytkowo_main/01
2020-10-01T21:15:07Z [STDOUT] # Job submit method: lsf
2020-10-01T21:15:07Z [STDOUT]
2020-10-01T21:15:07Z [STDOUT] # DIRECTIVES:
2020-10-01T21:15:07Z [STDOUT] #BSUB -J u-bx728.kopytkowo_main.1
2020-10-01T21:15:07Z [STDOUT] #BSUB -o /home/users/ygao003/cylc-run/u-bx728/log/job/1/kopytkowo_main/01/job.out
2020-10-01T21:15:07Z [STDOUT] #BSUB -e /home/users/ygao003/cylc-run/u-bx728/log/job/1/kopytkowo_main/01/job.err
2020-10-01T21:15:07Z [STDOUT] #BSUB -q short-serial
2020-10-01T21:15:07Z [STDOUT] #BSUB -W 23:30
2020-10-01T21:15:07Z [STDOUT] #BSUB -n 1
2020-10-01T21:15:07Z [STDOUT] export CYLC_DIR='/apps/contrib/metomi/cylc-7.8.1'
2020-10-01T21:15:07Z [STDOUT] export CYLC_VERSION='7.8.1'
2020-10-01T21:15:07Z [STDOUT] export ROSE_VERSION='2019.01.0'
2020-10-01T21:15:07Z [STDOUT] CYLC_FAIL_SIGNALS='EXIT ERR XCPU TERM INT SIGUSR2'
2020-10-01T21:15:07Z [STDOUT]
2020-10-01T21:15:07Z [STDOUT] cylcjobinstcylc_env() {
2020-10-01T21:15:07Z [STDOUT] # CYLC SUITE ENVIRONMENT:
2020-10-01T21:15:07Z [STDOUT] export CYLC_CYCLING_MODE="integer"
2020-10-01T21:15:07Z [STDOUT] export CYLC_SUITE_FINAL_CYCLE_POINT="1"
2020-10-01T21:15:07Z [STDOUT] export CYLC_SUITE_INITIAL_CYCLE_POINT="1"
2020-10-01T21:15:07Z [STDOUT] export CYLC_SUITE_NAME="u-bx728"
2020-10-01T21:15:07Z [STDOUT] export CYLC_UTC="True"
2020-10-01T21:15:07Z [STDOUT] export CYLC_VERBOSE="false"
2020-10-01T21:15:07Z [STDOUT] export TZ="UTC"

comment:19 Changed 5 months ago by pmcguire

Hi Yao:
I can't access your ~ygao003/roses or ~ygao003/cylc-run directory on JASMIN. Maybe you can give me access to those directories?

I looked at Sarah Chadburn's suite that you pointed me to (u-bx728) in MOSRS. Have you checked in any changes to that suite for working with SLURM instead of LSF? Maybe you have another suite number or something. The code in u-bx728/include/jasmin/suite.rc looks like it needs updating, for example.

Patrick

comment:20 Changed 5 months ago by yaogao

Hi Patrick,

I donot understand why 'chmod o+r' doesnot work … Or is there other command for giving the access to you?

I didnot do any changes to u-bx728 for working with SLURM…Is there any guild line on how to updating it to SLURM?

B.R
Yao

comment:21 Changed 5 months ago by pmcguire

Hi Yao:
If you look at this changeset that was done from u-bx722 to u-bx723 for Andy Wiltshire's and Carolina Duran-Rojas's global offline JULES GL7 suite, you can see the changes necessary on JASMIN to switch from LSF to SLURM batch processing
https://code.metoffice.gov.uk/trac/roses-u/changeset?reponame=&new=171494%40b%2Fx%2F7%2F2%2F3%2Ftrunk&old=171386%40b%2Fx%2F7%2F2%2F2%2Ftrunk

If you want to give other people read access to your ~ygao003/roses and ~ygao003/cylc-run, one way to do that is to also make sure that the other people also have read access to your ~ygao directory. But if you do that, then you might want to make sure that you don't accidentally enable read access to any confidential sub-directories.
Patrick

comment:22 Changed 5 months ago by pmcguire

Hi Yao:
Did this help?
Patrick

comment:23 Changed 5 months ago by yaogao

Hi Patrick,

This helped! With modification on files according to the link, Jules version 5.8 is working for me now on the correct machine with SLURM batch.

Thank you very much for your help!
Yao

comment:24 Changed 5 months ago by yaogao

I also be able to change the permission of files. -Yao

comment:25 Changed 3 months ago by grenville

  • Resolution set to fixed
  • Status changed from new to closed
Note: See TracTickets for help on using tickets.