Opened 7 years ago

Closed 7 years ago

#991 closed help (fixed)

Error during reconfiguration for UKCA aerosol job on HECToR

Reported by: gmann Owned by: willie
Component: UM Model Keywords: ukca
Cc: mdalvi Platform: HECToR
UM Version: 7.3

Description

Dear UM helpdesk,

I am (still) trying to port across from MONSOON to HECToR an N96L63 HadGEM3-A-r2.0 UKCA aerosol job — my job xhwrc — it is the HECToR-equivalent of Mohit Dalvi's job xfsju.

The job compiles OK but is failing during reconfiguration.

It seems to be some kind of memory problem — the error message is shown below and I've attached the log file for info.

I checked against one of Paul Telford's UKCA jobs that he is running on HECToR and I noticed that my job had sections 94, 95, 96, 97 options set to 1A whereas Paul's was set to 1C — so I tried a re-run with that switched to 1C but got the same error.

Do I have something set incorrectly for the MPI-comms on HECToR?
Or perhaps I just need to increase the memory settings somewhere?

I do have a lot of diags requested on the job so maybe there's just less memory available on HECToR than MONSOON during run-time — do you think it worth trying again with the diagnostic requests reduced? Or is that not likely to be the problem?

Any help/suggestion you can give is much appreciated.

The full log file is attached for info.

Thanks in advance for your help,

Cheers
Graham


Rank 0 [Fri Dec 7 16:17:29 2012] [c9-0c0s3n0] Fatal error in MPI_Testall: Other MPI error, error stack:
MPI_Testall(250)……………….: MPI_Testall(count=255, req_array=0x7ffffff14010, flag=0x7ffffff13fdc, status_array=0x7ffffff1440c) failed
MPIDI_CH3I_Progress(374)………..:
MPID_nem_mpich2_test_recv(813)…..:
MPID_nem_gni_poll(1477)…………:
MPID_nem_gni_check_recvCQ(1347)….:
MPID_nem_gni_process_recv(1252)….:
MPID_nem_handle_pkt(653)………..:
MPIDI_CH3_PktHandler_EagerSend(744): Failed to allocate memory for an unexpected message. 563038 unexpected messages queued.
[NID 01926] 2012-12-07 16:17:29 Apid 3184166: initiated application termination
/work/n02/n02/gmann/um/xhwrc/bin/qsexecute: Error in dump reconfiguration - see OUTPUT


Change History (15)

comment:1 Changed 7 years ago by gmann

Can't seem to attach file as max attachment size is only ~250K — anyway
the log files can be found on HECToR at:

/home/n02/n02/gmann/um/umui_out/xhwrc000.xhwrc.d12342.t143820.leave

/home/n02/n02/gmann/um/umui_out/xhwrc000.xhwrc.d12342.t170237.leave

comment:2 Changed 7 years ago by willie

Hi Graham,

The reconfiguration is working fine, the problem is with the model. There are some issues that need to be resolved first,

  • switch on subroutine timer diagnostics
  • select extra diagnostics and switch on STASH messages
  • you have resubmit with no job time - so Input/Output? Control and Resources and specify a job time - I used 3600 for the sake of a value
  • switch reconfiguration on
  • delete the time profile T24HDMRV as it is inconsistent
  • in Section 13 diagnostic prints switch on flush buffer and print every time step

This leaves a problem with Mohit's hand edit std/rdaev2_strat… which references MONSooN files e.g. /projects/ukca/inputs/sepctral/nml_ac_sw. This needs to be fixed.

I hope that helps,

Regards

Willie

comment:3 Changed 7 years ago by gmann

Hi Willie,

Thanks a lot for this.

It's strange because the job ran OK with these settings on MONSOON.

I guess points 1 and 2 are not the cause of the crash…

I've set the Resources for step 3 as you've suggested — but again, as this
was only an NRUN then I guess it should still have the run the 1-month step.

Re: your 4th point "switch reconfiguration on" — where should I do this in the UMUI?

Which bit of this was the cause of the crash?

Thanks again for your help,

Cheers
Graham

comment:4 Changed 7 years ago by willie

  • Owner changed from um_support to willie
  • Status changed from new to accepted

Hi Graham,

I'm not sure that I've solved "the" problem, just taken some steps towards getting there. With these settings I was able to complete reconfiguration without error (although Grenville is still having problems, so it may be intermittent) and get started on the model. The reconfiguration tick box is on the start dump page.

Regards,

Willie

comment:5 Changed 7 years ago by gmann

Hi Willie

I tried re-running the job xhwrc with your suggested changes applied — and I also copied over the missing spectral files for RADAERv2 over to HECToR and made a new hand-edit to put that HECToR path rather than Mohit's hand-edit which had the MONSOON path (/projects/ukca/mdalvi/xxxx) set.

Note that although you suggested I "turn reconfiguration on" it looks like it was already turned on in the job in the UMUI — so I didn't do that step.

When I re-ran, I still got the same strange MPI failure…

Rank 0 [Tue Dec 11 11:54:12 2012] [c5-1c0s5n0] Fatal error in MPI_Testall: Other MPI error, error stack:
MPI_Testall(250)……………….: MPI_Testall(count=255, req_array=0x7ffffff14950, flag=0x7ffffff1491c, status_array=0x7ffffff14d4c) failed
MPIDI_CH3I_Progress(374)………..:
MPID_nem_mpich2_test_recv(813)…..:
MPID_nem_gni_poll(1477)…………:
MPID_nem_gni_check_recvCQ(1347)….:
MPID_nem_gni_process_recv(1252)….:
MPID_nem_handle_pkt(653)………..:
MPIDI_CH3_PktHandler_EagerSend(744): Failed to allocate memory for an unexpected message. 587812 unexpected messages queued.
[NID 02452] 2012-12-11 11:54:12 Apid 3218358: initiated application termination
/work/n02/n02/gmann/um/xhwrc/bin/qsexecute: Error in dump reconfiguration - see OUTPUT

That xhwrc still has the same 16x16 domain-decomposition (256 cores) that I was running on MONSOON-2.

I guess this may need to be reduced on HECToR — how many processors do folks tend to run the HadGEM-A on on HECToR? And what's the best settings for E-W and N-S processors?

Also — one thing that has occurred to me —- the job requires small executables to run the RADAERv2 section at run-time —- I'm wondering whether the crash has something to do with this….

Mohit Dalvi (JWCRP post for UKCA at the MO) produced a script on PUMA that builds these using FCM

See in my directory "/home/gmann/test" on PUMA — there are 4 scripts:

build_qxcombine.sh
build_qxhistrep.sh
build_qxpickup.sh
build_qxsetup.sh

These I copied from Mohit's directory "/home/mdalvi/test/".

I'm wondering whether there could be an issue with the way these have been produced — perhaps the compiler settings are incompatible with HECToR or similar?

Thanks for your help

Cheers
Graham

comment:6 Changed 7 years ago by willie

Hi Graham,

I'm not getting these errors. I think it may be due to the module environment. After I loadcomp, I do

module swap PrgEnv-cray PrgEnv-cray/4.0.46
module swap xt-asyncpe xt-asyncpe/5.11

in my .profile. This is non-standard for the UM. With this and your changes, the model is still crashing:

lib-4190 : UNRECOVERABLE library error 
  A numeric input field contains an invalid character.

Encountered during a sequential formatted READ from unit 162
Fortran unit 162 is connected to a sequential formatted text file:
  "/work/n02/n02/gmann/RADAERv2/pcalc_hadgem_v2.ukca"
 Current format: (37x,3(i,1x))

However, this may be symptomatic of a deeper issue.

Regards,

Willie

comment:7 Changed 7 years ago by gmann

I added the two "module swap" commands to my .profile after the line with "loadcomp".
I then tried submitting again but it still give the similar error message as below.

Does your test job use the same domain decomposition settings etc.?

Perhaps you could point me to the .profile you are using so I can try using the same as mine — could be something in there that I've not quite got set right.

Thanks a lot for your help,

Cheers
Graham

Rank 0 [Tue Dec 11 20:00:56 2012] [c1-0c0s4n0] Fatal error in MPI_Testall: Other MPI error, error stack:
MPI_Testall(250)……………….: MPI_Testall(count=255, req_array=0x7ffffff13ff0, flag=0x7ffffff13fbc, status_array=0x7ffffff143ec) failed
MPIDI_CH3I_Progress(374)………..:
MPID_nem_mpich2_test_recv(813)…..:
MPID_nem_gni_poll(1477)…………:
MPID_nem_gni_check_recvCQ(1347)….:
MPID_nem_gni_process_recv(1252)….:
MPID_nem_handle_pkt(653)………..:
MPIDI_CH3_PktHandler_EagerSend(744): Failed to allocate memory for an unexpected message. 1048073 unexpected messages queued.
[NID 02696] 2012-12-11 20:00:57 Apid 3224016: initiated application termination
/work/n02/n02/gmann/um/xhwrc/bin/qsexecute: Error in dump reconfiguration - see OUTPUT

comment:8 Changed 7 years ago by willie

Hi Graham,

My .profile should be visible at /home/n02/n02/wmcginty/.profile. My version of your experiment is xhzp , user 'willie'.

regards

Willie

comment:9 Changed 7 years ago by gmann

Hi Willie,

I tried using your .profile — initially I got an error where it complained about not being able to find the file "$HOME/.ssh/ssh_agent_setup" — so I commented out that section of the script
that has the "if [ -z "$PBS_ENVIRONMENT" ]" conditional expression.

Then the job submitted OK but failed with (pretty much) the same MPI error I was getting
when I used my usual .profile.

So I still can't get my job to run correctly on HECToR (although it is running fine on MONSOON).

Once I can get as far as the " A numeric input field contains an invalid character." error you're getting I can fix that.

I note that you said that Grenville was having the same problem as me to get the job running.

Do you have any other suggestion for what could be different between your runs and mine/Grenville's that is causing this discrepancy?

Thanks for your help with this,

Cheers
Graham

comment:10 Changed 7 years ago by gmann

Hi Willie,

Grenville found he had to reduce the PEs for the reconfiguration to 32 to get reconfiguration to complete.

When I did this it then completed the reconfiguration OK and it now starts to run but fails with the read format error you posted above — i.e. when reading:

/work/n02/n02/gmann/RADAERv2/pcalc_hadgem_v2.ukca

So this is progress!

And I know how to fix that above…

Thanks for your help

Cheers
Graham

comment:11 Changed 7 years ago by gmann

Hi Willie,

To update on the above.

I fixed the issue with the read statement within RADAER — there's a line of code reading from that pcalc file that seems to fail on HECToR whereas it completed OK on MONSOON. The failure is when trying to read a line on the RADAER code that specifies the resolution of the UKCA look-up tables for the Mie calculations like:

UKCA accum. aerosol LUT dimensions: 51, 51, 51

which are in the file:

/work/n02/n02/gmann/RADAERv2/pcalc_hadgem_v2.ukca

There is a formatted read statement in "ukca_radaer_read_precalc.F90" to do this as below with the formatting set as '(37x,3(i,1x))'.

Basically, that format statement seems to read OK on MONSOON but not on HECToR.

I changed that format statement in that line in that routine in my branch on PUMA to be more explicit — instead set to be '(37x,1i3,1x,1i3,1x,i)'.

That then reads in the info correctly on MONSOON and the job proceeds past there…..

However, unfortunately the job then crashes with a segmentation fault in Atm_Step (before even reaching UKCA (i.e. before reaching the call to UKCA_MAIN).

Willie — do you have any idea what could be causing the crash here — I guess it could be memory issue. This job is only running on 32 processors so I guess each core will be operating on quite a large sub-domain so the memory footprint could be higher.

Thanks for any help you can give here,

Cheers
Graham

comment:12 Changed 7 years ago by willie

Hi Graham,

Did you "include modifications from working copy" and remember to switch off the branch fcm:um_br/dev/gmann/vn7.3_HG3r2_mergCJ_nprim_Radv2_HECToR/src?

It's a good idea to delete any core files that exist before starting a new run.

Regards,

Willie

comment:13 Changed 7 years ago by gmann

Hi Willie,

I switched off "include modifications from working copy" because I committed the change to that read format statement in "ukca_radaer_read_precalc.F90" back to my branch "vn7.3_HG3r2_mergCJ_nprim_Radv2_HECToR" (done at r10588) — see I then switched back on to instead use the FCM repository branch:

fcm:um_br/dev/gmann/vn7.3_HG3r2_mergCJ_nprim_Radv2_HECToR/src

with the revision number updated to use that latest revision r10588.

From the runs, it looks like we need to do the reconfiguration on 4x8 domain decomposition (32 cores) but that it seems we need to do the actual model run on 8x16 PE configuration (128 cores — uses less memory on each node during runtime).

Mohit explained that at v7.3 you can't simply do a compile, link and run with the reconf set to a different PE configuration to the model run (although apparently you can at versions after v7.5).

He explained however, that you can do this at v7.3 if you submit the job in 2 steps.

So I followed his advice that first I should submit the job with the Reconfiguration option "Perform the reconfiguration step only" button selected. Note that here the job should be set as required with reconf as 4x8 and run as 8x16 but it won't try to run yet….

That submitted job will then reconfigure (with the .astart file produced) but will not try to run the model…

He advised that once that is done, one can set the job back to normal (deselect the "Perform reconfig step only" button) but change the "initial start dump" to point to the .astart file generated in the 1st step above — i.e. for my run change it to point to:

xhwrc.astart

in the directory:

/work/n02/n02/gmann/um/xhwrc/

I tried this last night — and it does get further now — it now reaches the UKCA code — it goes through the whole of the UKCA_MAIN but it is crashing at the end of UKCA_MAIN with "Attempt to free invalid pointer" error message as below. I'm not sure whether this is a deallocate statement that is not quite set correctly or something like this — or maybe it is still having some memory problems….. It could possibly something that gets set or allocated in the chemistry or aerosol modules (UKCA_CHEMISTRY_CTL or UKCA_AERO_CTL) which are only called once every 3 timesteps — maybe that could be the cause.

Any suggestion on how to isolate the cause?
Or other advice or suggestion for how best to proceed to fix this?

Cheers
Graham

craylibsgoogle-perftools/src/tcmalloc.cc:569] Attempt to free invalid pointer: 0x2020202020202020
craylibs
google-perftools/src/tcmalloc.cc:569] Attempt to free invalid pointer: 0x2020202020202020
craylibsgoogle-perftools/src/tcmalloc.cc:569] Attempt to free invalid pointer: 0x2020202020202020
craylibs
google-perftools/src/tcmalloc.cc:569] Attempt to free invalid pointer: 0x2020202020202020
craylibsgoogle-perftools/src/tcmalloc.cc:569] Attempt to free invalid pointer: 0x2020202020202020
craylibs
google-perftools/src/tcmalloc.cc:569] Attempt to free invalid pointer: 0x2020202020202020
craylibsgoogle-perftools/src/tcmalloc.cc:569] Attempt to free invalid pointer: 0x2020202020202020
craylibs
google-perftools/src/tcmalloc.cc:569] Attempt to free invalid pointer: 0x2020202020202020
craylibsgoogle-perftools/src/tcmalloc.cc:569] Attempt to free invalid pointer: 0x2020202020202020

comment:14 Changed 7 years ago by willie

Hi Graham,

After some difficulty, I managed to get the same errors as you have reported. If I select the debug option, the code crashes in copystash before it reaches UKCA. This indicates a lack of robustness in this set up.

The abort occurs at the end of the first time step and after the completion of UKCA main and occurs in a perfectly innocent bit of code. This suggests that the problem occurred earlier, perhaps by overwriting an array or because some MPI process has not completed properly.

I notice that the hand edit ~mdalvi/umui_jobs/hand_edits/std/HG3_r2_base.ed gives errors - the "?"s in the panel that appears when you push the process button on the UMUI. This might be worth investigating.

I'm afraid that I've run out of ideas on this.

Regards,

Willie

comment:15 Changed 7 years ago by willie

  • Resolution set to fixed
  • Status changed from accepted to closed
Note: See TracTickets for help on using tickets.