Opened 11 months ago

Last modified 2 weeks ago

#2527 accepted task

audit/profiling of JULES on CEDA JASMIN

Reported by: pmcguire Owned by: pmcguire
Component: JULES Keywords: JULES, audit, profiling, CEDA JASMIN
Cc: jgimenez Platform: Other
UM Version:

Description

Can you work with the POP service people to do an audit and profiling of JULES on CEDA JASMIN?

Change History (23)

comment:1 Changed 11 months ago by pmcguire

The POP service person has been working on this. He can run JULES on CEDA JASMIN with their profiling tools. Last I heard from him was in March. I will contact the CEDA JASMIN people to see if they can investigate the delay.

comment:2 Changed 11 months ago by pmcguire

  • Status changed from new to accepted

comment:3 Changed 11 months ago by pmcguire

I contacted Fatima Chami at CEDA. She contacted POP. And then Judit Gimenez from POP wrote back and said that they can't reach Martin either. So Judit proposed that somebody else from their consortium start this JULES on JASMIN study again from scratch in 2-3 weeks. I accepted their proposal.

comment:4 Changed 7 months ago by pmcguire

Judit Gimenez at POP is continuing to work on the study of the JULES global suite on CEDA JASMIN.

comment:5 Changed 6 months ago by pmcguire

Judit Gimenez at POP was successful in getting a POP-instrumented gridded JULES run working on CEDA JASMIN.

comment:6 Changed 5 months ago by pmcguire

Judit Gimenez at POP has delivered a preliminary profiling analysis of JULES on CEDA JASMIN. We plan to discuss this by phone soon.

comment:7 Changed 3 months ago by pmcguire

I discussed the situation with Judit Gimenez from POP several weeks ago. We decided that it would be best to repeat the analysis with a run over the whole globe instead of only over the UK.

comment:8 Changed 3 months ago by pmcguire

Hello Patrick,

After some unsuccessful tests, on Monday I submitted a job that does not
instrument the application, just to check if that way also get the
coredumps. In fact, I do not get the coredumps, but seems it does not
work either.

I had a look on the logs, but not sure what it is the problem.

I saw in the log that it fails:

[judit@jasmin-cylc 01]$ pwd
/home/users/judit/cylc-run/u-ar790-16/log/job/1/jules/01

[judit@jasmin-cylc 01]$ more job-activity.log
[jobs-submit ret_code] 0
[jobs-submit out] 2019-03-04T13:14:56Z|1/jules/01|0|9098241
2019-03-04T13:14:56Z [STDOUT] Job <9098241> is submitted to queue
<par-multi>.
[(('event-mail', 'failed'), 1) ret_code] 0

But the rror does not give me any clue.. maybe I am looking in a wrong file?

[judit@jasmin-cylc suite]$ pwd
/home/users/judit/cylc-run/u-ar790-16/log/suite
[judit@jasmin-cylc suite]$ tail log.20190304T130657Z
2019-03-04T13:14:02Z INFO - [fcm_make.1] -(current:running)(polled)
started at 2019-03-04T13:06:59Z
2019-03-04T13:14:55Z INFO - [fcm_make.1] -(current:running)> succeeded
at 2019-03-04T13:14:55Z
2019-03-04T13:14:56Z INFO - [jules.1] -submit-num=1, owner@host=localhost
2019-03-04T13:14:57Z INFO - [jules.1] -(current:ready) submitted at
2019-03-04T13:14:56Z
2019-03-04T13:14:57Z INFO - [jules.1] -health check settings: submission
timeout=None, polling intervals=PT1M,…
2019-03-04T13:14:59Z INFO - [jules.1] -(current:submitted)> started at
2019-03-04T13:14:57Z
2019-03-04T13:14:59Z INFO - [jules.1] -health check settings: execution
timeout=None, polling intervals=PT1M,…
2019-03-04T13:15:05Z CRITICAL - [jules.1] -(current:running)>
failed/EXIT at 2019-03-04T13:15:04Z
2019-03-04T13:15:05Z CRITICAL - [jules.1] -job(01) failed
2019-03-04T13:15:06Z WARNING - suite stalled
[judit@jasmin-cylc suite]$

Thanks!

judit

comment:9 Changed 3 months ago by pmcguire

Hi Judit
For this rose/cylc suite, we have redirected the main error logs to the /work/scratch/judit/logs directory. One of your log files ( /work/scratch/judit/logs/err_16.txt ) says that you have an "Error reading namelist JULES_OUTPUT". I suspect this is because you wanted to disable the NETCDF output, which is what I think you did. Maybe it's not possible to disable the output completely. Maybe you can disable only certain variables in the output, if that would help. Or maybe there is another way to write the code to disable the NETCDF output completely.
Patrick McGuire?

Last edited 3 months ago by pmcguire (previous) (diff)

comment:10 Changed 2 months ago by pmcguire

Hello Patrick,

I have tried to set some output so it does not fail, but I may be doing something wrong because the file output is empty and I get the same error.

What I did was to uncomment one of the profiles I had (last one that seemed short). Now the related lines are:

[namelist:jules_output]
#nprofiles=3
nprofiles=1
output_dir='/work/scratch/judit/config/outputs'
run_id='Euro44_bvv_nancil_CTL-BCJ-NH_jules-vn4.9_u-ar790globeA'
(…)

[namelist:jules_output_profile(1)]
file_period=-2
nvars=3
output_end='2012-12-31 18:00:00'
output_main_run=.true.
output_period=10800
output_spinup=.false.
output_start='2000-01-01 00:00:00'
output_type='M'
profile_name='3-hourly-CiDiag?'
var='ciDiag','gpp','npp'
var_name=

Do you have any clue? The configuration file is in /home/users/judit/roses/u-ar790-16/app/jules/rose-app.conf and the log and error under /work/scratch/judit/logs

Thanks!!
Judit

comment:11 Changed 2 months ago by pmcguire

Hi Judit
I did a diff (see below) between my u-ar790 project and your u-ar790-16 project.
I never attempted to get the namelist #[namelist:jules_output_profile(3)]
working on CEDA JASMIN. That is one reason it was commented out.
You're now using that jules_output_profile(3) instead of the original jules_output_profile(1) for what one might presume to be less-IO-intensive runs.

I do see one problem with that namelist, in that the output_type should be output_type=3*'M' instead of output_type='M'. Also, the output start is output_start='2000-01-01 00:00:00' and the output end is output_end='2012-12-31 18:00:00' and your main run end is main_run_end='1989-12-31 21:00:00' (unlike the original suite, which had the main run end as main_run_end='2012-12-31 21:00:00'). So this will likely produce no output since the run ends before the output begins and before it ends.

I think that if you want reduced output, that you might want to change output_period=10800 to
output_period=21600 or longer, and profile_name=3-hourly-CiDiag to profile_name=6-hourly-CiDiag

I don't think that ciDiag exists as a variable in JULES. This variableciDiag was added in a special branch of JULES that we aren't using here. So you can get rid of that variable, and adjust the number of variables in output_type accordingly.

An alternative is to revert back to the original #[namelist:jules_output_profile(1)] ,
reduce the number of nvars,
and reduce the number of variables in output_type,
and get rid of most of the variables in var=.,
and change the output_end data to be before the end of your run.

Patrick


`diff -r /home/users/pmcguire/roses/u-ar790 /home/users/judit/roses/u-ar790-16/
diff /home/users/pmcguire/roses/u-ar790/app/jules/rose-app.conf /home/users/judit/roses/u-ar790-16/app/jules/rose-app.conf
#some of the changes have been omitted here, for brevity's sake
364,378c364,378
< [namelist:jules_output_profile(1)]
< file_period=-2
< nvars=19
< output_end='2012-12-31 18:00:00'
< output_main_run=.true.
< output_period=21600
< output_spinup=.true.
< output_start='1979-01-01 00:00:00'
< output_type=19*'M'
< profile_name='6-hourly'
< var='esoil_gb','npp_gb','gpp_gb','smc_avail_tot','t_soil',
< ='ftl_gb','fsmc_gb','tstar_gb','latent_heat','ext',
< ='surf_roff','drain','pstar','rad_net','precip','et_stom_gb',
< ='sw_net','lw_net','smc_avail_top'
< var_name=
—-

#[namelist:jules_output_profile(1)]
#file_period=-2
#nvars=19
#output_end='2012-12-31 18:00:00'
#output_main_run=.true.
#output_period=21600
#output_spinup=.true.
#output_start='1979-01-01 00:00:00'
#output_type=19*'M'
#profile_name='6-hourly'
#var='esoil_gb','npp_gb','gpp_gb','smc_avail_tot','t_soil',
# ='ftl_gb','fsmc_gb','tstar_gb','latent_heat','ext',
# ='surf_roff','drain','pstar','rad_net','precip','et_stom_gb',
# ='sw_net','lw_net','smc_avail_top'
#var_name=

401,412c401,412
< #[namelist:jules_output_profile(3)]
< #file_period=-2
< #nvars=3
< #output_end='2012-12-31 18:00:00'
< #output_main_run=.true.
< #output_period=10800
< #output_spinup=.false.
< #output_start='2000-01-01 00:00:00'
< #output_type='M'
< #profile_name='3-hourly-CiDiag?'
< #var='ciDiag','gpp','npp'
< #var_name=
—-

[namelist:jules_output_profile(1)]
file_period=-2
nvars=3
output_end='2012-12-31 18:00:00'
output_main_run=.true.
output_period=10800
output_spinup=.false.
output_start='2000-01-01 00:00:00'
output_type='M'
profile_name='3-hourly-CiDiag?'
var='ciDiag','gpp','npp'
var_name=`

comment:12 Changed 7 weeks ago by pmcguire

Hello Patrick,

[redacted] I see the process runs much longer and I think it is being killed
by CPU time

MPI Application rank 0 killed before MPI_Finalize() with signal 12
User defined signal 2
mpirun: propagating signal 12
2019-03-30T11:54:08Z CRITICAL - failed/SIGUSR2

Before resending them to the queue with a large wall-clock I would like
to eliminate some messages that each MPI print at each timestep. For
example:

{MPI Task 5} [INFO] next_time: Timestep: 55395; Started at: 1979-03-01
10:00:00
{MPI Task 5} [INFO] next_time: Spinup cycle: 4

Do you know how I can remove them?

Thanks!!

judit

comment:13 Changed 7 weeks ago by pmcguire

Hi Judit
To answer your first question:
Yes, your run for the whole globe is set for the standard number of spinup years (35), and it made it through 25 of them before the maximum wall-clock (48 hours) expired. The computer performance during spinup is probably different than during the main run, which is after the spinup. Maybe for testing purposes, you can reduce the number of spinups requested in the namelists from 35 to 12? This number 12 might be a good place to start. The resulting model outputs won't be the same as for a 35 year spin up for the whole globe, but that probably is not much a concern right now. Then it could maybe get through most or all of the ~10 years of the main run before the 48 hour wall-clock expires.


To answer your second question, I did the following grep's:
grep -r 'next_time' /group_workspaces/jasmin2/jules/pmcguire/jules_build/jules-vn4.9_positiverain/*

grep -r 'next_time(' /group_workspaces/jasmin2/jules/pmcguire/jules_build/jules-vn4.9_positiverain/*

This led me to look in the file next_time.F90:
vi /group_workspaces/jasmin2/jules/pmcguire/jules_build/jules-vn4.9_positiverain/src/control/standalone/next_time.F90

where I saw that the boolean controlling the printing was print_step, which led me to do this grep:
grep -r 'print_step' /group_workspaces/jasmin2/jules/pmcguire/jules_build/jules-vn4.9_positiverain/*

where it became apparent that print_step is defined in the JULES namelist: [namelist:jules_time=print_step].

You're using JULES 4.9 (positive_rain branch from the trunk), so the documentation for the namelist is:
http://jules-lsm.github.io/vn4.9/namelists/timesteps.nml.html

You can define print_step to be whatever you want it to be in your app/jules/rose-app.conf
Does this work for you?
Patrick

comment:14 Changed 7 weeks ago by pmcguire

  • Cc jgimenez added

comment:15 Changed 7 weeks ago by pmcguire

Thanks Patrick for the detailed description!! I have done some changes and resubmit the job. I will keep you informed.

Thanks!

judit

comment:16 Changed 6 weeks ago by pmcguire

Hello Patrick,

I am sorry but I cannot find how to add a comment to the ticket.

I reduced the number of spinups

[judit@jasmin-cylc jules]$ grep -i spinup_cycles rose-app.conf
max_spinup_cycles=12
#max_spinup_cycles=0
[judit@jasmin-cylc jules]$

But the job has been killed again:

MPI Application rank 2 killed before MPI_Finalize() with signal 12
User defined signal 2

Should I modify other fields?

Thanks!
Judit

comment:17 Changed 6 weeks ago by pmcguire

Hi Judit:
According to your error log file /work/scratch/judit/logs/err_16.txt, your job files with this User defined signal at 9:28AM, European time on April 7:

MPI Application rank 2 killed before MPI_Finalize() with signal 12
User defined signal 2
mpirun: propagating signal 12
2019-04-07T08:28:13Z CRITICAL - failed/SIGUSR2

This is exactly 48 hours after the job was started. Your log file /work/scratch/judit/logs/log_16.txt says that it was started at 9:28AM on April 5:

Sender: LSF System <lsfadmin@host243.jc.rl.ac.uk>
Subject: Job 1952879: <u-ar790-16.jules.1> in cluster <lotus> Exited

Job <u-ar790-16.jules.1> was submitted from host <jasmin-cylc.ceda.ac.uk> by user <judit> in cluster <lotus> at Fri Apr  5 09:28:08 2019.
Job was executed on host(s) <5*host243.jc.rl.ac.uk>, in queue <par-multi>, as user <judit> in cluster <lotus> at Fri Apr  5 09:28:09 2019.
                            <1*host066.jc.rl.ac.uk>
                            <2*host093.jc.rl.ac.uk>
                            <3*host088.jc.rl.ac.uk>
                            <5*host219.jc.rl.ac.uk>
</home/users/judit> was used as the home directory.
</> was used as the working directory.
Started at Fri Apr  5 09:28:09 2019.
Terminated at Sun Apr  7 09:28:14 2019.
Results reported at Sun Apr  7 09:28:14 2019.

I do note that JASMIN was having some group workspace problems (for /gws/nopw/j04/) on April 6 and 7 and 8, but I don't know if these problems affected the scratch drive or the LOTUS batch processing. Maybe it did slow things down a bit. Anyways, your job was started on April 5, and it should have started producing output files 1-2 hours after it started. If I do an ls -ltr /work/scratch/judit/config/outputs/Euro44_bvv_nancil_CTL-BCJ-NH_jules-vn4.9_u-ar790globeA.*, I see that you produced spinup files up till the 10th spinup year before it ran out of time after the 48 hour wall clock. This is consistent with the 48 hour wallclock limit for the par-multi queue on JASMIN. Maybe you can reduce the number of spinups to 2 or 3 or something, and then you will be guaranteed to get some of the main simulation run before the wall clock expires?
I usually have to do continuation runs if it runs out of wallclock before the run is over.

You can look at the header info for the output:

[pmcguire@jasmin-sci2 jules]$ ncinfo  /work/scratch/judit/config/outputs/Euro44_bvv_nancil_CTL-BCJ-NH_jules-vn4.9_u-ar790globeA.12-hourly.spin8.1979.nc
<type 'netCDF4._netCDF4.Dataset'>
root group (NETCDF4 data model, file format HDF5):
    dimensions(sizes): x(720), y(360), time(729), nt(2)
    variables(dimensions): float32 time_bounds(time,nt), float32 time(time), float32 latitude(y,x), float32 longitude(y,x), float32 latent_heat(time,y,x)

And you can see that it only has the latent_heat variable, as you desired. You can also view the output on jasmin-sci* (not jasmin-cylc) with ncview instead of ncinfo.

comment:18 Changed 2 weeks ago by jgimenez

Hello Patrick,

Apologies for the delay but I have been out of the office most of the time last month so I was not able to login to jasmin.

I just checked the output of my last job after reducing the number of spinups. How can I validate that now I cover enough simulation time before the job is killed?
Thanks!!
judit

comment:19 Changed 2 weeks ago by pmcguire

Hi Judit
You can do an ls -ltr /work/scratch/judit/config/outputs/.
This will show you which output files were created, and when they were created, and their sizes.
The 12-hourly files are the output files, and the year for each file is in the file name.
The spin files were during the model spin-up, prior to the main run.
Patrick

comment:20 Changed 2 weeks ago by jgimenez

Thanks Patrick! Checking the files I saw it run 4 years (from 1979 to 1982). Is that enough period for the analysis?

Thanks again!

judit

comment:21 Changed 2 weeks ago by pmcguire

Hi Judit
That should probably be enough to see if there is any difference in performance from 'before the main run started' to 'after the main run started'. It looks like the main run started at Apr 24 00:56.

You can tell that 1982 was not completed by the time your run finished since its file size is different than the previous years.

If you want to see the output (for one year), you can:
ncview /work/scratch/judit/config/outputs/Euro44_bvv_nancil_CTL-BCJ-NH_jules-vn4.9_u-ar790globeA.12-hourly.1981.nc
This has one variable (Latent_heat flux), averaged over 12 hours.

As you might remember, I am personally particularly interested in how efficient the parallel I/O is (and how we might improve it), so this version of the suite with minimal I/O won't tell us everything. But it will give us great information on the performance of the parallel numerical computations on CEDA JASMIN.

Patrick

comment:22 Changed 2 weeks ago by jgimenez

Great! I will start to instrument this run then. My first goal for the run without I/O is to check the scaling of the computations, as the previous analysis seems to indicate some relevant code replication. Once this is verified with this large input, we can do some analysis of the I/O with this input case if you consider that it would be different than the previous input case (I mean with respect to the ratio between computations and I/O).

Can you please remind me how many tasks I should use for this input? That run was with 16. Should I replicate with 32, 64 and 128 like the previous case?

thanks!

judit

comment:23 Changed 2 weeks ago by pmcguire

Hi Judit
I normally run the global JULES simulations with 16 or 32 processors. It would be great if we could improve things so that using 64 or 128 processors would be worth using. When I tried 64 processors in the past, it didn't speed things up that much (with full I/O). If you can try for 64 and 128 processors, I think that might be informative, especially for the global case (as opposed to the 'UK-region'-only case).
Patrick

Note: See TracTickets for help on using tickets.