Opened 5 months ago

Closed 3 months ago

#3211 closed help (answered)

ARCHER-specific packing error

Reported by: Leighton_Regayre Owned by: um_support
Component: UM Model Keywords: ARCHER packing memory
Cc: thomas.langton@…, lucia.deaconu@…, chmcsy@… Platform: ARCHER
UM Version: 11.1

Description

Hello,

Our team have come across a 'WGDOS Packing' error that occurs when suites are submitted to ARCHER, but do not occur when submitted to Monsoon. On Monsoon, all fields are unpacked and data is as expected. However, with minimal changes to suites so that they run on ARCHER, the packing error re-emerges.

We have spent some weeks discussing and testing possible causes of the packing error problem (following #3148) with Mohit Dalvi and also Alejandro Bodas-Salcedo who is responsible for the diagnostics that cause the packing error (e.g. 2,468). Our efforts and tests lead us to believe the error is ARCHER specific and is possibly caused by memory allocation.

We think the problem may be memory related because the code that calculates the COSP diagnostics is memory intensive. There are architecture differences between ARCHER and Monsoon (XCS). The maximum number of nodes that can be requested on Monsoon is 36, rather than 24 as on ARCHER. On Monsoon parallel jobs are given access to the full node, which is reserved for the job in question. Is this the case on ARCHER?

I attempted a test using suite u-bs158 where I set the processors_per_node request to 20 (rather than 24). In this case the number of E-W processors (MAIN_ATM_PROCX) is 16 and the number of N-S processors (MAIN_ATM_PROCY) is 24 (rather than 12 as default). The point of this test was to see if allocating fewer processors for each requested node would leave capacity for memory intensive calculations. This works on Monsoon apparently.

As with all tests using these diagnostics on ARCHER, the atmos_main task failed as soon as it reached the COSP diagnostics. In this case 2,468.

Our team has worked on this problem extensively over the past few weeks and we've exhausted all possible causes that we can conceive of, so we're very much relying on cms expertise to help us get to the bottom of our problem.

Thanks,

Leighton

Attachments (4)

job.err (18.2 KB) - added by langtont 5 months ago.
atmos_main job.err
job.2.err (18.2 KB) - added by langtont 5 months ago.
atmos_main job.out
job.out (689.5 KB) - added by langtont 5 months ago.
Actual atmos_main job.out
build.out (11.4 KB) - added by langtont 5 months ago.
fcm_make_um job.out

Download all attachments as: .zip

Change History (41)

comment:1 Changed 5 months ago by langtont

Just two things to avoid later confusion.

On Monsoon the fields are packed using the New Climate packing (5), rather than unpacked. New Climate is the packing profile we want to use on ARCHER too.

Secondly, the diagnostic 2,468 is a new output, not in ordinary STASHmaster files, however there is no additional calculation that goes on within the model code. The diagnostics 2,468–2,472 just reads the calculated MMRs from the cosp_sghydro class that is already used for many cosp calculations, and outputs them. There should be no reason why there are any unfeasible values being produced for this diagnostic, other than cosp itself having calculated nonsense values for this variable.

Tom

Last edited 5 months ago by langtont (previous) (diff)

comment:2 Changed 5 months ago by grenville

What exactly is the problem with the unpacked ARCHER data?

Grenville

comment:3 Changed 5 months ago by Leighton_Regayre

Hi Grenville,

Obviously we can't check log files on ARCHER at the moment, so I can't show you the exact error. However, from my recollection it takes the same form as Lucia's error in ticket #3148 though is now related to diagnostic 2,468 instead of 2,351. We overcame the issues with 2,351 by altering the initialisation value of y%sunlit in the COSP code (correct if needed Tom).

It would be very useful to know how the excess processors on each node are allocated on ARCHER.

Thanks,

Leighton

comment:4 Changed 5 months ago by grenville

Leighton

Monsoon and ARCHER work the same in respect of under populating nodes. Sounds like you have underpopulated and requested more processors - that will give a more memory per process on two fronts. I doubt memory is the problem - insufficient memory usually causes model failure with seg faults or similar.

I wasn't asking about error messages - if you write out unpacked data on ARCHER and compare it with unpacked data on Monsoon, where are the differences? Monsoon typically uses a different version of the Cray compiler (& other software) — you could try to mimic the Monsoon environment on ARCHER by loading the appropriate models or simpler, try to build COSP with low optimization compiler options.

But maybe you have already tried this?

Grenville

comment:5 Changed 5 months ago by grenville

loading the appropriate modules — that should say

comment:6 Changed 5 months ago by Leighton_Regayre

Hi Grenville,

Thanks for the suggestions. It's good to know Monsoon and ARCHER treat underpopulated nodes in the same way and that memory issues are unlikely to be the cause of our problem.

Tom Langford had already tested your suggestion to mimic environments, but he did it in reverse by mimicking the ARCHER environment on Monsoon and it his suite ran without issue.

I am currently running tests to compare output packed and unpacked for 2,351 (suites u-bs158 and u-bs369).

Lucia, Tom and myself have discussed your suggestions and none of us have the experience to 'build COSP with a low optimization compiler options'. Are there relatively straight-forward changes we can make to our suite, or would this require in-depth code changes?

Thanks,

Leighton

Last edited 5 months ago by Leighton_Regayre (previous) (diff)

comment:7 Changed 5 months ago by grenville

Leighton

Let me try it out - it's easier to provide an example.

Grenville

comment:8 Changed 5 months ago by Leighton_Regayre

Grenville,

Thanks very much! An example will be very useful.

In the meantime, my test with suite u-bs369 where the packing for the only COSP diagnostic requested 2,351 is set to 'unpacked' leasds to a packing error in 2,201. I haven't had any errors with the 2,201 diagnostic until COSP diagnostics were introduced. Doesn't this suggest data from another diagnostic is being passed to the 2,201 packing erroneously?

For contrast, the suite u-bs158 is identical to u-bs369 apart from the packing settings for the p11 output stream reserved for COSP diagnotics. This atmos_main task fails with the packing error for 2,351.

Thanks for your help,

Leighton

comment:9 Changed 5 months ago by grenville

what is the Monsoon equivalent suite referred to above. Building COSP with the lowest level of optimization didn't help.

Grenville

comment:10 Changed 5 months ago by langtont

Hi Grenville,

The Monsoon suite for which everything works is u-bs049.

Tom

comment:11 Changed 5 months ago by grenville

Tom

Where is the output for that?

Grenville

comment:12 Changed 5 months ago by grenville

u-bs158 and u-bs049 are not identical? The sources differ.

Grenville

comment:13 Changed 5 months ago by Leighton_Regayre

Hi Grenville,

Thanks for attempting to build COSP with lowest level compiler. I appreciate the effort, even though it didn't pay off.

The apparent differences in sources are not the problem. I merged my and Lucia's branches in an effort to isolate the cause of this packing problem. So, in u-bs158 Lucia's branch has been removed. I also included a branch suggested by Alejandro which is associated with a COSP bug fix, but this hasn't affected the packing error in any way.

We're all very confused by this platform specific difference in COSP packing.

Thanks,

Leighton

comment:14 Changed 5 months ago by grenville

Leighton

OK - thanks, I'll pursue when ARCHER cooperates

Grenville

comment:15 Changed 5 months ago by langtont

Hi Grenville,

I didn't archive the u-bs049 data to MASS as it's just a test suite, so it's still sat in my cylc-run directory. I'll make it 744 so you can read it.

Tom

comment:16 Changed 5 months ago by grenville

I really wanted to see job.out and job.err (for the build & the run)

comment:17 Changed 5 months ago by Leighton_Regayre

Hi all,

Sorry, I made changes to u-bs158 on Friday afternoon in order to expediently implement some tests suggested by Alejandro.

Suite u-bs369 has the packing error with job.out and job.err files:
/work/n02/n02/lre/cylc-run/u-bs369/log/job/20161101T0000Z/atmos_main000/01

In this suite I turned off all COSP diagnostics apart from 2,351 and also turned off all COSP simulators apart from 'cloudsat'.

Thanks,

Leighton

Changed 5 months ago by langtont

atmos_main job.err

Changed 5 months ago by langtont

atmos_main job.out

comment:18 Changed 5 months ago by grenville

It'd be helpful to see the job.out and job.err files for u-bs049

Changed 5 months ago by langtont

Actual atmos_main job.out

Changed 5 months ago by langtont

fcm_make_um job.out

comment:19 Changed 5 months ago by langtont

Hi Grenville,

I've attached the job.out and job.err files to the ticket. The job.err for the fcm_make_um was empty, and I accidentally double uploaded the atmos_main job.err.

Tom

comment:20 Changed 5 months ago by grenville

Tom

I've been looking in the wrong place (/home/d04/thola ?) - please enable group read rights on /home/d03/tlangton

chmod -R g+rX /home/d03/tlangton

Grenville

comment:21 Changed 5 months ago by grenville

and
chmod -R g+rX projects/ukca-ox/tlangton — thanks

comment:22 Changed 5 months ago by langtont

Grenville,

Should be done for both. Let me know if there's any issues with this.

Tom

comment:23 Changed 5 months ago by grenville

Tom

Still can't get there

cd /projects/ukca-ox/tlangton/cylc-run/u-bs049
-bash: cd: /projects/ukca-ox/tlangton/cylc-run/u-bs049: Permission denied

Grenville

comment:24 Changed 5 months ago by grenville

It'd to see what STASHMaster was used in u-bs049 -

comment:25 Changed 5 months ago by Leighton_Regayre

Hi Grenville,

Tom tells me the STASHmaster he used for u-bs049 is identical to the one I used in u-bs158 and u-bs369, which is:
fcm:um.xm_br/dev/luciadeaconu/vn11.1_vn11.1_ACURE_PPE_all_diagnostics/rose-meta/um-atmos/HEAD/etc/stash/STASHmaster@80573

This can be viewed on pumatest here as well:
/home/luciad/branches/vn11.1_vn11.1_ACURE_PPE_all_diagnostics/rose-meta/um-atmos/HEAD/etc/stash/STASHmaster

Thanks,

Leighton

comment:26 Changed 5 months ago by grenville

Please let me have access to /projects/ukca-ox/tlangton/cylc-run/u-bs049

comment:27 Changed 5 months ago by langtont

Hi Grenviille,
Sorry for the delay, the chmod has taken a long time to process for some reason. The one in my home directory should now be viewable (/home/d03/tlangton/cylc-run/u-bs049), I'm currently doing it on the /projects/ukca-ox/tlangton one now (although might already be done if it's a symbolic link).

Sorry again for the delay,

Tom

comment:28 Changed 5 months ago by grenville

gmslis@xcslc0:~/roses/u-br688> cd /projects/ukca-ox/tlangton/cylc-run/u-bs049
-bash: cd: /projects/ukca-ox/tlangton/cylc-run/u-bs049: Permission denied

comment:29 Changed 5 months ago by langtont

tlangton@xcslc0 : /projects/ukca-ox/tlangton/cylc-run$ ls -alh

drwxr-sr— 10 tlangton ukca-ox 4096 Mar 3 17:23 u-bs049

Permissions should be set? Could it be something to do with the s rather than the x?

Tom

comment:30 Changed 5 months ago by grenville

do this

chmod -R g+rX projects/ukca-ox/tlangton

the world needs x to see in dirtectories

comment:31 Changed 5 months ago by langtont

This was the command I used but no luck. I think I've got around it by using chmod -R g-s to remove the setgid bit. Should be sorted now sorry.

comment:32 Changed 5 months ago by Leighton_Regayre

Hi Grenville,

I have run some additional tests as advised by Alejandro Bodas-Salcedo (Met Office) which may be helpful.

Suite u-bs449 is a copy of the UKESM1-AMIP release job for ARCHER (u-bm251). The only change I made was to add usage (UPWR) and domain (DTHSUBC) fields required for the COSP diagnostic 2,351 but it once again crashes with the packing error for diagnostic 2,351.

These usage and domain fields are used for the COSP diagnostics in Tom's monsoon suite.

Does this help?

Thanks,

Leighton

comment:33 Changed 5 months ago by grenville

Hi Leighton

Jeff is looking at this problem. There is a strange interaction between 2351 and 2201 on ARCHER - the packing error is a red herring. Please be patient.

Grenville

comment:34 Changed 5 months ago by jeff

Hi Leighton

I've looked into your problem and the packing error is caused by a problem which happens much earlier in the run. The basic problem is an array is being written beyond its bounds.

In subroutine cosp_diagnostics for diagnostic 2,351 COSP: CLOUDSAT REFLECTIVITY, in the call to set_pseudo_list at line 997

    CALL set_pseudo_list(sgradar%Ncolumns,len_stlist,                          &
             stlist(1,stindex(1,sc_code,sect,im_index)),                       &
             l_sr_levels,stash_pseudo_levels,num_stash_pseudo,icode,cmessage)

sgradar%Ncolumns has the value 64. In subroutine set_pseudo_list you have

SUBROUTINE set_pseudo_list                                        &
      (n_levels,len_stlist,stlist,pseudo_list,                    &
      stash_pseudo_levels,num_stash_pseudo,icode,cmessage)
                                   .
                                   .
                                   .
LOGICAL ::                                                        &
   pseudo_list(n_levels) ! OUT List of pseudo levels required.
                                   .
                                   .
                                   .
DO jlev=1,n_levels
  pseudo_list(jlev)= .FALSE.
END DO

So pseudo_list which is l_sr_levels, has a size of n_levels = sgradar%Ncolumns = 64 and has all array values written. In routine cosp_diagnostics you have

  LOGICAL :: l_sr_levels(SR_BINS)

and in module COSP_CONSTANTS_MOD you have

    integer,parameter :: SR_BINS       =   15

So a size 15 array is being written as if it is a size 64 array, this causes other variables to get overwritten and undefined things to happen. On Archer this has the effect of causing other diagnostics to have cosp values which are outside the expected bounds of the variable and cannot be packed successfully. On Monsoon this didn't happen for some reason.

Maybe the solution is to increase the size of SR_BINS but as I'm not an expert in this code there maybe other things which need to be taken in account.

Also there maybe other similar problems lurking in this code so it might be a good idea to run with array bounds checking on.

Jeff.

comment:35 Changed 5 months ago by Leighton_Regayre

Hi Jeff,

Excellent detective work - thank-you!

I don't know much about the cosp code either, but it looks like SR_BINS=15 in my copy of the UKESM11.1 release job code. I'll pass on the details of what you have discovered to our project team and collaborators so that the problem is addressed properly.

I'm not sure what you mean by running with 'array bounds checking' on. How do we turn that on and what is its effect?

Thanks again for the insights,

Leighton

comment:36 Changed 5 months ago by jeff

Hi Leighton

To enable arrays bounds checking you need to compile the code with the Cray Fortran compiler option -Rb. To do this add this option to fcm_make_um -> Advanced compilation -> fcflags_overrides in rose edit.

I decided to try this out and it did find the problem but it also found what I hope are false positives. You get a lot of output to search through. This problem was the only one found in the COSP code. You can see the output from a short 6 hour run here, /home/n02/n02/jwc/cylc-run/u-bs369_2/job.err_arrayboundschecking.

Jeff.

comment:37 Changed 3 months ago by grenville

  • Resolution set to answered
  • Status changed from new to closed
Note: See TracTickets for help on using tickets.