Opened 4 weeks ago

Closed 8 days ago

#3500 closed help (fixed)

Problems with JASMIN tutorial/SLURM

Reported by: scollins Owned by: um_support
Component: JASMIN Keywords: JASMIN, SLURM
Cc: Platform: JASMIN
UM Version:

Description

Hi Patrick

I'm trying to run JULES on JASMIN for the first time in a year, and I'm struggling to get it running with the new set up. My goal is to get a version of JULES vn5.2 that I have modified running. The code appeared to compile OK, but then I get the attached error (job_collins.err, job_collins.out) when running JULES. Anyway, after reading several tickets in the NCAS helpdesk, I decided I'd try to walk before I can run and I tried to get the Reading tutorial working. However, this also appears to compile OK and then and all of the JULES runs fail with the attached error (job_752.err, job_752.out).

Any advice you could give me would be much appreciated.

Many thanks

Sarah

Attachments (4)

job_752.err (2.5 KB) - added by scollins 4 weeks ago.
al-752
job_752.out (6.3 KB) - added by scollins 4 weeks ago.
al-752
job_collins.err (1.6 KB) - added by scollins 4 weeks ago.
Error with a version of vn5.2
job_collins.out (5.7 KB) - added by scollins 4 weeks ago.
Error with a version of vn5.2

Download all attachments as: .zip

Change History (20)

Changed 4 weeks ago by scollins

al-752

Changed 4 weeks ago by scollins

al-752

Changed 4 weeks ago by scollins

Error with a version of vn5.2

Changed 4 weeks ago by scollins

Error with a version of vn5.2

comment:1 Changed 4 weeks ago by pmcguire

Hi Sarah:
I looked at your job_752.err file. It looks like there is an error in the namelists.

The error suggests that your ~/roses/u-al752/ settings have created a file

/work/scratch-pw/sarcol/cylc-run/u-al752/work/1/jules_at_neu_presc0/./model_environment.nml

that has problems with the lsm_id setting.
I don't know why this is in your model_environment.nml file. This should be in your jules_lsm_switch.nml file.
So maybe you have lsm_id misconfigured in your ~/roses/app/jules/rose-app.conf file?

I can't see your ~/roses/u-al752 directory, but if you can change the permissions on your home directory so that I can see the files, that would help.
Patrick

comment:2 Changed 3 weeks ago by scollins

Hi Patrick

Thanks for your help. I'm not sure why lsm_id was in model_environment.nml, but I've fixed that now. There were a number of variables in the namelists that aren't in JULES vn5.8 (in JULES_SURFACE, JULES_RADIATION), which I have now removed. However, I'm not sure what its issue is with JULES_LATLON - can you tell?

I think I've given you read permission to roses and cylc-run directories and their subdirectories (chmod -R +r roses), but let me know if it didn't work.

[FATAL ERROR] init_latlon: Error reading namelist JULES_LATLON (IOSTAT=19 IOMSG=invalid reference to variable in NAMELIST input, unit 1, file /work/scratch-pw/sarcol/cylc-run/u-al752/work/1/jules_cn_sw2_presc0/./model_grid.nml, line 8, position 10)

Thanks again!

Sarah

comment:3 Changed 3 weeks ago by pmcguire

Hi Sarah:
The position line 8, position 10 of the file:

/work/scratch-pw/sarcol/cylc-run/u-al752/work/1/jules_cn_sw2_presc0/./model_grid.nml

corresponds to the = sign after const_val in the namelist below:

&jules_latlon
const_val=41.7902,111.8971,
constant_val='see conf','see conf',
/

In another run of u-al752, this is the same contents of that namelist, in this file:

~pmcguire/cylc-run/u-al752shyland/work/1/jules_cn_sw2_presc0/model_grid.nml
&jules_latlon
latitude=41.7902,
longitude=111.8971,
/

Maybe your ~/roses/u-al752/app/jules/rose-app.conf file differs in the jules_latlon settings
from ~pmcguire/roses/u-al752/app/jules/rose-app.conf, where the settings are:

[namelist:jules_latlon]
latitude='see conf'
longitude='see conf'

I can't read your roses or cylc-run directories. Maybe you need to set up the permissions for your home directory too?
Patrick

comment:4 Changed 3 weeks ago by scollins

Hi Patrick

Ah, sorry, yes that was me messing about with JULES_LATLON to try to get it working. It always seems to add "const_val=41.7902,111.8971," to the namelist. I've tried with your set up and this gives me:

&jules_latlon
const_val=41.7902,111.8971,
latitude='see conf',
longitude='see conf',

You should be able to read my home directory now, but let me know if it doesn't work.

Thank you for your help!

Sarah

comment:5 Changed 3 weeks ago by pmcguire

Hi Sarah:
I still can't read your home directory. Can you do a chmod -R g+rX on it?

Also, I guess you aren't using the latest version of u-al752 from MOSRS or you have modified it significantly.
Can you get things working by using the latest version of u-al752 from MOSRS?

Patrick

comment:6 Changed 3 weeks ago by scollins

Hi Patrick

I've deleted my copy of u-al752 and checked out a new copy. Unfortunately, I'm getting the same errors as before with MODEL ENVIRONMENT, etc. Can you see my files now? I did chmod -R g+rx on my home directory.

Thanks!

Sarah

comment:7 Changed 3 weeks ago by pmcguire

Hi Sarah:
I still can't see your home directory or its subdirectories.
When I do a ls -dl ~sarcol/, this is what I get:

drwx------ 1 sarcol users 0 Mar 25 14:54 /home/users/sarcol/

Those permissions only allow you do do things with your home directory.
Patrick

comment:8 Changed 3 weeks ago by scollins

Hi Patrick

I've tried chmod -R 777 *, which will presumably give read, write and execute permission. Does that work?

Sarah

comment:9 Changed 3 weeks ago by pmcguire

Hi Sarah:
No, it still doesn't work.
When I do ls -ld ~sarcol, I still get that your permissions are as follows:

drwx------ 1 sarcol users 0 Mar 25 17:55 /home/users/sarcol/

Maybe you can do chmod -R g+rX ~sarcol/?
That way, at least it gets the whole home directory and not just its contents.
You can check after you do it that it works, by doing ls -dl ~sarcol.
Patrick

Last edited 3 weeks ago by pmcguire (previous) (diff)

comment:10 Changed 3 weeks ago by scollins

Hi Patrick

This is now:

drwxr-x--- 1 sarcol users 0 Mar 25 17:55 /home/users/sarcol

Is that correct?

Thanks

Sarah

comment:11 Changed 3 weeks ago by scollins

Hi Patrick

I didn't manage to get u-al752 running, but I've managed to get my own code and rose suite running now, so I'm going to give up with u-al752.

Thanks for your help!

Sarah

comment:12 Changed 3 weeks ago by pmcguire

Hi Sarah:
Thank you for reporting this. I can see your home directory now. Thank you.

By looking at your ~sarcol/roses/u-al752 directory, I realized that the tutorial needed updating from the January version, since the suite had been modified by others since then. I have updated the tutorial.

The key thing that needed updating was the JULES_REVISION='18213' was being used until after January sometime, whereas it is now at a later version.

I have updated the tutorial at https://research.reading.ac.uk/landsurfaceprocesses/software-examples/tutorial-rose-cylc-jules-on-jasmin/ to match the new JULES FLUXNET suite update from JULES 5.8 to JULES 6.0. The key steps in the tutorial that have been updated are steps 8 and 9. I have also added a step 11 about SLURM, if you're interested.

If you have some time, can you try running u-al752 again, following the revised tutorial? I have also updated the MOSRS version of the suite to match the SLURM configuration on JASMIN better, so it might be worth checking out or updating your copy of the suite.

Thanks
Patrick

comment:13 Changed 3 weeks ago by scollins

Hi Patrick

Yes, that's working now, thanks!

Sarah

comment:14 Changed 3 weeks ago by pmcguire

Hi Sarah:
It's all working? (the u-al752 suite, that is) Excellent! Great!
If you see anything else that needs fixing, let me know.
I will keep this ticket open for at least a few days, in case you see anything else.
After that, I will close the ticket.
Patrick

comment:15 Changed 3 weeks ago by scollins

Yes, all working, thanks!

Sarah

comment:16 Changed 8 days ago by pmcguire

  • Resolution set to fixed
  • Status changed from new to closed
Note: See TracTickets for help on using tickets.