Opened 7 months ago

Last modified 6 months ago

#3501 new help

Open MPI or JULES problem in two suites from two users on JASMIN

Reported by: pmcguire Owned by: um_support
Component: JASMIN Keywords: Open MPI, JASMIN, JULES
Cc: NoelClancy Platform: JASMIN
UM Version:

Description (last modified by pmcguire)

Hi CMS Helpdesk:
I just got my suite u-cc615 to do offline/standalone JULES almost through the RECON step of the run, which means that I got most of the ancillaries and driving data and namelists configured almost or all the way right.

But I saw the same/similar errors that (CMS Helpdesk user: NoelClancy, JASMIN user: nmc) is currently seeing in his suite u-cd187, as described in CMS Helpdesk ticket #3494.

Here are my log files. I am not quite sure what to make of them right now. Any suggestions? I don't know if it some sort of Open MPI problem or a JULES problem. There are error messages related to both Open MPI and JULES.

in: ~/cylc-run/u-cc615/log/job/19790101T0000Z/RECON/22/job.out

{MPI Task 0} [INFO] init: Initialisation is complete

Primary job terminated normally, but 1 process returned a non-zero exit code. 
Per user-direction, the job has been aborted.

and:

in: ~/cylc-run/u-cc615/log/job/19790101T0000Z/RECON/22/job.err

forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image              PC                Routine            Line        Source
jules.exe          0000000000804E73  Unknown               Unknown  Unknown
libpthread-2.17.s  00007FA98EB99630  Unknown               Unknown  Unknown
jules.exe          000000000066E293  qsat_mod_mp_qsat_         118  qsat_mod.F90
jules.exe          00000000007A5607  sf_stom_mod_mp_sf         671  sf_stom_jls_mod.F90
jules.exe          00000000007656C6  physiol_mod_mp_ph         961  physiol_jls_mod.F90
jules.exe          0000000000677465  sf_expl_l_mod_mp_         796  sf_expl_jls.F90
jules.exe          00000000005C74D7  surf_couple_expli         497  surf_couple_explicit_mod.F90
jules.exe          0000000000410568  control_                  571  control.F90
jules.exe          000000000040CCD3  MAIN__                    136  jules.F90
jules.exe          000000000040CB92  Unknown               Unknown  Unknown
libc-2.17.so       00007FA98E5DA555  __libc_start_main     Unknown  Unknown
jules.exe          000000000040CAA9  Unknown               Unknown  Unknown
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[40658,1],9]
  Exit code:    174
--------------------------------------------------------------------------
[host435.jc.rl.ac.uk:111775] PMIX ERROR: NO-PERMISSIONS in file gds_dstore.c at line 702
[host435.jc.rl.ac.uk:111775] PMIX ERROR: NO-PERMISSIONS in file gds_dstore.c at line 702
[host435.jc.rl.ac.uk:111775] PMIX ERROR: NO-PERMISSIONS in file gds_dstore.c at line 711
[FAIL] rose-jules-run <<'__STDIN__'
[FAIL]
[FAIL] '__STDIN__' # return-code=174
2021-03-23T16:59:44Z CRITICAL - failed/EXIT


Patrick

Change History (4)

comment:1 Changed 7 months ago by pmcguire

  • Description modified (diff)

comment:2 Changed 7 months ago by pmcguire

Hi CMS Helpdesk
I have been able to get past the RECON phase, to the spinup phase.
It turned out that I had some of the namelist time ranges set incorrectly for the driving data and the prescribed data.
It still crashes in the spinup phase, and sometimes in the RECON phase, but I am making progress.
It has also been helpful to set up output profiles at the time-frequency of the driving data, and to make sure that the output NETCDF file has an ending time before the point at which the suite currently crashes, so that the NETCDF file is completed properly, and I can see what was going on before the crash.
Patrick

Last edited 7 months ago by pmcguire (previous) (diff)

comment:3 Changed 6 months ago by pmcguire

Hi CMS Helpdesk
The near-surface air temperature was getting very cold within 1 day on RECON or spinup from idealized conditions, and the suite crashed soon thereafter. It turned out that I had mistakenly used lw_down instead of lw_net (and sw_down instead of sw_net) in the driving data namelists, which meant that since my driving data files were with negative lw_net that it was getting rather cold since it though this was lw_down. It is preferred by JULES to use lw_down and sw_down, but those are not the data files that I produced.

When I run with lw_net and sw_net instead of lw_down and sw_down (as allowed by JULES), there seem to be new problems with the near-surface air temperature getting very cold for one of the corner grid cells after a few hours, so maybe I am doing something else wrong. I will keep at it.
Patrick

Last edited 6 months ago by pmcguire (previous) (diff)

comment:4 Changed 6 months ago by pmcguire

Hi CMS Helpdesk
So the problem that I was having with "the near-surface air temperature getting very cold in one of the corner grid cells" was probably caused at least in part by not using l_albedo_obs=.true. in the settings of my suite. I had set that to .false. since I didn't have the required prescribed variables ready in ancillary files for albobs_vis and albobs_nir. The original PRIMAVERA suite that was run on ARCHER2 had used l_albedo_obs=.true., and since I am using that data as driving data for the JULES suite on JASMIN, then I should use the same settings. Basically, the suite didn't know the proper observed albedos to use.

I have adapted the suite after creating new ancillary files for albobs_vis and albobs_nir, and I am trying to rerun the suite now.
Patrick

Note: See TracTickets for help on using tickets.