Opened 11 months ago

Closed 10 months ago

Last modified 10 months ago

#3443 closed help (answered)

v8.4 UM-UKCA reconfiguration failing with script invoking unknown "srun" command instead of usual "aprun" command.

Reported by: gmann Owned by: um_support
Component: UM Model Keywords: reconfiguration
Cc: elsd Platform: ARCHER
UM Version: 8.4

Description (last modified by ros)

Dear NCAS-CMS Helpdesk,

I am trying to submit a GA4 UM-UKCA NRUN simulation to ARCHER from PUMA (UM v8.4), and there seems to have been a change in the basic UM scripts that are generated to run the UM reconfiguration.

I've set the job (xnels) to compile the model executable and the reconfiguration executable, and to run the reconfiguration executable and then the model executable.

This is the standard method we use for an NRUN, to reconfigure a previous dump file, changing the date to match where we're restarting the model from.

In this case I'm actually running from the same start date, but I need to run the reconfiguration because of a slight glitch in the way the UM-UKCA dump files are set up and provided to re-start the model (but that's another story).

Anyway — basically, when the model is submitting, it is completing the compilation of the model and reconfiguration executables, see for example:

/home/n02/n02/gmann/output/xnels000.xnels.d21004.t160918.comp.leave

has "build is OK" for UMSCRIPTS, UMATMOS and UMRECON:

gmann@eslogin004:~/output> grep 'is OK' xnels000.xnels.d21004.t160918.comp.leave
UMSCRIPTS build is OK
UMATMOS build is OK
UMRECON build is OK

But then when it proceeds to run the reconfiguration, it is failing to start because it seems to be trying to run the command "srun" instead of the usual command "aprun".

I compared to a recent similar UM-UKCA run xooos, I noticed that the lines in the script "qsrecon" seem to have replaced the sections of the script with "aprun" instead with "srun":

gmann@eslogin004:~/output> diff /work/n02/n02/gmann/um/xnels/bin/qsrecon /work/n02/n02/gmann/um/xoooq/bin/qsrecon
121c121,135
< if [[ $CRAYMPP = true ]]; then
---
> if [[ $XC40 = true ]]; then
>   export PBS_O_WORKDIR=$(readlink -f $PBS_O_WORKDIR)               
>   cd $PBS_O_WORKDIR
>   OMP_NUM_THREADS=${OMP_NUM_THREADS:-1}
>   HYPERTHREADS=${HYPERTHREADS:-1}
>   if (( $RCF_NPES < $NTASKS_PER_NODE ))
>   then
>      # Running underpopulated. 
>      NTASKS_PER_NODE=$RCF_NPES
>   fi
>   echo aprun -ss -cc cpu -n $RCF_NPES -N $NTASKS_PER_NODE \
>       -d $OMP_NUM_THREADS -j $HYPERTHREADS $LOADRECON >>$OUTPUT
>   aprun -ss -cc cpu -n $RCF_NPES -N $NTASKS_PER_NODE \
>       -d $OMP_NUM_THREADS -j $HYPERTHREADS $LOADRECON >>$OUTPUT
> elif [[ $CRAYMPP = true ]]; then
126,127c140,143
<   echo srun --cpu-bind=cores $LOADRECON >> $OUTPUT
<   srun --cpu-bind=cores $LOADRECON >> $OUTPUT
---
>   echo aprun -n $RCF_NPES -N $NTASKS_PER_NODE -d 1 \
>       -S $NTASKS_PER_NUMANODE -ss $LOADRECON >> $OUTPUT
>   aprun -n $RCF_NPES -N $NTASKS_PER_NODE -d 1 \
>       -S $NTASKS_PER_NUMANODE -ss $LOADRECON >> $OUTPUT

When I look within the script qsrecon, I can see those lines are different after the "RCF OUTPUT" section:

if [[ $CRAYMPP = true ]]; then
  mpprun -n$RCF_NPES $LOADRECON >>$OUTPUT
elif [[ $IBM = true && $MPP = true ]]; then
  poe $LOADRECON -procs $RCF_NPES >>$OUTPUT
elif [[ $XT4 = true && $MPP = true ]]; then
  echo srun --cpu-bind=cores $LOADRECON >> $OUTPUT
  srun --cpu-bind=cores $LOADRECON >> $OUTPUT

With those lines in the previous (similar) job xooos having the $XT4 = true && $MPP = true else-if clause then running aprun:

if [[ $XC40 = true ]]; then
  export PBS_O_WORKDIR=$(readlink -f $PBS_O_WORKDIR)
  cd $PBS_O_WORKDIR
  OMP_NUM_THREADS=${OMP_NUM_THREADS:-1}
  HYPERTHREADS=${HYPERTHREADS:-1}
  if (( $RCF_NPES < $NTASKS_PER_NODE ))
  then
     # Running underpopulated. 
     NTASKS_PER_NODE=$RCF_NPES
  fi
  echo aprun -ss -cc cpu -n $RCF_NPES -N $NTASKS_PER_NODE \
      -d $OMP_NUM_THREADS -j $HYPERTHREADS $LOADRECON >>$OUTPUT
  aprun -ss -cc cpu -n $RCF_NPES -N $NTASKS_PER_NODE \
      -d $OMP_NUM_THREADS -j $HYPERTHREADS $LOADRECON >>$OUTPUT
elif [[ $CRAYMPP = true ]]; then
  mpprun -n$RCF_NPES $LOADRECON >>$OUTPUT
elif [[ $IBM = true && $MPP = true ]]; then
  poe $LOADRECON -procs $RCF_NPES >>$OUTPUT
elif [[ $XT4 = true && $MPP = true ]]; then
  echo aprun -n $RCF_NPES -N $NTASKS_PER_NODE -d 1 \
      -S $NTASKS_PER_NUMANODE -ss $LOADRECON >> $OUTPUT
  aprun -n $RCF_NPES -N $NTASKS_PER_NODE -d 1 \
      -S $NTASKS_PER_NUMANODE -ss $LOADRECON >> $OUTPUT

I'm wondering if this is maybe an issue with the scripts being changed for the new ARCHER machine?

Do I need to make an edit to the UM job to generate the original aprun version of the qsrecon script?

Thanks for your help with this,

Best regards,

Cheers
Graham


Change History (7)

comment:1 Changed 11 months ago by ros

Hi Graham,

Yes this is because we have put in changes to get vn8.4 running on ARCHER2.

In window FCM Configuration → FCM Options for the Atmosphere and Reconfiguration set the revision number for the branch fcm:um_br/pkg/Config/vn8.4_ncas/src to be 20509 and resubmit. That should then build the scripts correctly for ARCHER.

Regards,
Ros.

comment:2 Changed 11 months ago by gmann

Hi Ros,

Ah, OK — right.

That's interesting to note the work to get vn8.4 running on ARCHER.

I must admit that I am pleased to hear that will potentially be available still
to run with the transition onto ARCHER-2 as I had thought it has been
indicated in previous communication that that would not be the case.

I've updated the job in the UMUI and re-submitted that just now.

Best regards,

Cheers
Graham

comment:3 Changed 11 months ago by grenville

Hi Graham

We'd planned not to support UMUI jobs on ARCHER2, relying instead on NEXCS for that. But, since the decision was made to switch off NEXCS, we (semi-reluctantly) ported the UMUI for UMs 8.4 and 7.3.

Grenville

comment:4 Changed 11 months ago by gmann

Hi Grenville,

I was not aware that there was a strategy to support v7.3 and v8.4 UMUI jobs to submit to NEXCS.

The communications I'd been party to is that the UMUI versions were simply not going to be
supported on any of the new systems — it's re-assuring to learn that was not the case,
although in that case, it has caused a lot of unnecessary stress and worry to those of us
trying to plan for the progression of the model development in future years.

I had also not heard the NEXCS system was to be switched off, but maybe that machine was
only ever intended for an interim period?

Anyway, I appreciate you giving us a heads-up about this, and I'm pleased to hear that
also the v7.3 UM will be ported to ARCHER-2. There is a lot of science that was carried
out with UM-UKCA at that v7.3 (and even more so at v8.4) which will have a much better
chance of making its way into UKESM2 (and other configurations) with that being the case.

I appreciate it makes more work for the NCAS-CMS team, but (in my opinion) the investment
will be of considerable value to the UKESM and UKCA communities both within University side
and at the Met Office, in terms of the developments at those versions that will now be able
to be recovered and potentially included in future UKESM/UM/UKCA configurations.

Best regards,

Cheers
Graham

Last edited 11 months ago by gmann (previous) (diff)

comment:5 Changed 11 months ago by gmann

Last edited 11 months ago by gmann (previous) (diff)

comment:6 Changed 10 months ago by ros

  • Resolution set to answered
  • Status changed from new to closed

comment:7 Changed 10 months ago by ros

  • Description modified (diff)
Note: See TracTickets for help on using tickets.