Opened 8 years ago

Closed 8 years ago

#828 closed defect (fixed)

MPPIO error when trying to read start dump

Reported by: swr05npk Owned by: annette
Component: UM Model Keywords: startdump, mppio, aquaplanet, read
Cc: Platform:
UM Version: 7.8

Description

I am attempting to run an aqua-planet job at version 7.8 on HECToR, using the N96L85 resolution. The job is xhccb.

After reconfiguring from an existing aqua-planet start dump (at N48L70), when I try to run the model every processor prints the error message

Opening unit  21 with collective(broadcast) semantics
   Read Only mode
 ************************* IO Error Report ***************************************
Unit Generating error=   21
 ---File States --------------------------
Unit  21 open on filename /work/n02/n02/pappas/um/xhccb/xhccb.astart
  --> Opened from environment variable:ASTART
   --> Read Only:  T  Local:  T  AllLocal:  F  Remote:  F  Broadcast:  T
 ---End File States ----------------------
 *******************************************************************************
 UM WARNING :
 Routine generating warning: mppio:file_open
 Warning code:  -12
 Warning message: 
Unit already open - attempting close
 *******************************************************************************
 MPPIO: Checking consistency of unit open request...

The output from processor 1 is attached. You can view the start dump on HECToR at

/work/n02/n02/pappas/um/xhccb/xhccb.astart

The model is running from the exectuables

/work/n02/n02/pappas/um/xhcca/bin/xhcca.exe
/work/n02/n02/pappas/um/xhcca/bin/qxreconf

To me, it looks as though the model is attempting to open the start dump when it is already open, generating an error. Aside from running in an aqua-planet configuration, the only other out-of-the-ordinary thing I have done with this job is to reset the time in the start dump from 12Z on 12/5/2005 to 12Z on 1/5/2005 to make dumping and meaning easier.

Thanks for any assistance you can provide.

Attachments (1)

xhccb.fort6.pe1.gz (22.0 KB) - added by swr05npk 8 years ago.

Download all attachments as: .zip

Change History (14)

Changed 8 years ago by swr05npk

comment:1 Changed 8 years ago by willie

  • Owner changed from um_support to willie
  • Status changed from new to accepted

Hi Nick,

Could you let me have read permission on the work and home directories. Do

chmod -R g+rx .

(note the dot) on both directories. There is a STASH error,

Level list invalid in PSEUDO panel of Domain Profile 'DICECAT' (Edit Profile in window atmos_STASH)
Variable: PSLIST_A(*,17)

→ Model Selection

→ Atmosphere

→ STASH

→ STASH. Specification of Diagnostic requirements

→ Domain profile window, 2

which can be corrected by deleting the extra levels.

Regards,

Willie

comment:2 Changed 8 years ago by swr05npk

Hi Willie,

I think the permissions problems should be fixed now.

I removed the additional levels in that STASH domain profile, but that did not correct the problem with reading the start dump.

Thanks,
Nick

comment:3 Changed 8 years ago by willie

Hi Nick,

I found an aquaplanet job output at the Met office that has these messages even when it is working. The real problem is the segmentation fault: the core file indicates it is occurring in mpl_waitall.f90 (line 48). The only thing I can suggest at the moment is that you select the debug level of optimisation (Compile Options for Model page), rebuild your code and try again.

Regards,

Willie

comment:4 Changed 8 years ago by swr05npk

Hi Willie,

I have compiled in the debugging symbols, as you suggested. I can't make heads or tails of the backtrace that I get from gdb, though:

(gdb) backtrace
#0  0x0000000001a86438 in tc_malloc ()
#1  0x000000000182d14f in ckalloc (size=8) at avl.c:104
#2  0x000000000182d301 in new_node (size=8, data=<value optimized out>) at avl.c:129
#3  avl_insert (size=8, data=<value optimized out>) at avl.c:404
#4  0x000000000182d1ea in avl_insert (data=0x7ffffd61be80, size=8, rootp=0x10060100, compar=0x182a066 <udreg_vma_compare>) at avl.c:415
#5  0x000000000182d333 in avlins (data=0x7ffffd61bf38, tree=0x10060100) at avl.c:822
#6  0x000000000182bad7 in vma_new (end=<value optimized out>, start=<value optimized out>, cache=<value optimized out>) at udreg_core.c:249
#7  dreg_insert (end=<value optimized out>, start=<value optimized out>, cache=<value optimized out>) at udreg_core.c:441
#8  dreg_new_entry (end=<value optimized out>, start=<value optimized out>, cache=<value optimized out>) at udreg_core.c:756
#9  UDREG_Register (end=<value optimized out>, start=<value optimized out>, cache=<value optimized out>) at udreg_core.c:1029
#10 0x00000000017f25df in MPID_nem_gni_dreg_register ()
#11 0x0000000000000000 in ?? ()

I guess it's an error in the MPI routines somewhere, but I don't know exactly where or how to fix it.

There is a core file sitting in /work/n02/n02/pappas/um/xhccb/core. It was generated by /work/n02/n02/pappas/um/xhcca/bin/xhcca_g.exe.

Cheers,
Nick

comment:5 Changed 8 years ago by willie

Hi Nick,

I was rather hoping this would solve the problem. You need to tick the "Compile Model executable" button. If this has been done, could you then give me read permissions on the core file.

Regards,

Willie

comment:6 Changed 8 years ago by swr05npk

Hi Willie,

Yes, I recompiled the executable with the debugging symbols included. This is /work/n02/n02/pappas/um/xhcca/bin/xhcca_g.exe.

I have corrected the permissions on the new core file (/work/n02/n02/pappas/um/xhccb/core). Sorry about that - thought I had done it already!

Thanks, Nick

comment:7 Changed 8 years ago by willie

Hi Nick,

It has a segmentation fault in exactly the same place as before. Pursuing the optimisation route further, could you add the compile override ~willie/overrides/hector_cce_reduced_opt in the user file overrides and tick the "compile model executable" button. This reduces the optimisation to -O0 and may make it work. Recompile and run. Unfortunately, you have to change the permissions on the core file every time it is created.

Failing this, you should revert to a known operational aqua-planet run and proceed from there.

regards

Willie

comment:8 Changed 8 years ago by swr05npk

Hi Willie,

Reducing the optimization to -O0 made no difference; the model still crashes in the same place.

I wish I had a "known operational" aqua-planet configuration! No one on puma has tried to run a version 7.x aqua-planet job (according to a UMUI search).

I have received an aqua-planet job from the Met Office at version 7.9, however, which I uploaded into the UMUI (as job xhccd). We do not have version 7.9 on HECToR, unfortunately, so I had to upgrade the job to version 8.0 (xhcce).

This job, xhcce, fails with a different error message:

An error occured inside the MPI library during an operation
 on the IOS<->Atmos communicator for normal ops
 IOS_MPI_ERROR: MPI_COMMUNICATOR= 1140850688 MPI_ERROR_CODE= 67756546  aborting...

There's a core file on HECToR at /work/n02/n02/pappas/um/xhcce/core, generated by /work/n02/n02/pappas/um/xhcce/bin/xhcce_g.exe.

I have already tried re-compiling with the debugging level of optimisation, but to no effect.

Let me know if you want me to open a separate ticket for this error, but I think that there's something going fundamentally wrong with the MPI implementation in the model here.

comment:9 Changed 8 years ago by willie

Hi Nick,

I think upgrading a v7.9 to a vn8.0 job by your self is a difficult route to take. It would be better to go back to the original source an get either an 7.8 version or an 8.0 version. There is certainly a 7.8 version at the Met Office, but I have no idea of its provenance.

Regrads,

Willie

comment:10 Changed 8 years ago by annette

  • Owner changed from willie to annette

Hi Nick,

Returning to the original error you had with xhccb, I'm not sure whether the optimisations were switched off properly. For this 7.8 job you need to use a slightly different compile override (the compile variables are not the same at different UM vns). This one worked for me:

/home/annette/hadgem3/overrides/hector_cce_7.8_O0

To see what compile options were used you can check the .comp.leave file or the ummodel/cfg/bld.cfg file in the build directory. Be careful about deleting a compilation job mid-way through as this can confuse FCM when editing compile flags. In this case it is best to let it run out or delete the build directory and start again with a clean build.

Also even with optimisations off, gdb will still sometimes not give helpful information. In this case you can use totalview which is a much more sophisticated GUI debugger. The options for debugging a core file are the same:

totalview exec core

Annette

comment:11 Changed 8 years ago by annette

Hi Nick,

The error in your 7.8 job (xhccb) comes from setcona calling swap_bounds_mv with variables r_at_u and r_at_v (which are the heights at u and v points on rho levels). So the MPPIO prints in the output file are a bit of a red herring as it actually fails further along. It's not clear to me what causes the segmentation fault though.

I have had more success with the 8.0 job (xhcce). I tried running this on MONSooN and got a couple of errors:

  1. Invalid count (-8) in MPI_Recv
    This occured in the filtering since some of the settings need to be amended for L85. This was fixed by using settings from the 7.8 HG3-A L85 job here: Section by section choices → Diffusion, filtering and moisture resetting
  1. Error in vert_eng_massq: should not be callable
    This is called by code calculating section 30 diagnostics, and was fixed by switching section 30 diagnostics off in STASH (there were only 2).

Porting the job back to HECToR it now runs for 3 days. My version of your 8.0 job is xhdpe. If you want to use this job, there are some standard VN8.0 HadGEM3-A N96L85 configurations that you could perhaps integrate with:

http://collab.metoffice.gov.uk/twiki/bin/view/Project/CAPTIVATE/HadGEM3Evolution

There are versions of some of these on puma (for MONSooN and HECToR) - for example search under Oliver's username odarbysh.

Best wishes,

Annette

comment:12 Changed 8 years ago by swr05npk

I think I have now solved this issue, with Annette's help.

The crash in swap_bounds_mv is caused by an error in the energy-correction initialisation routine. There are two extra arguments in the call to eng_mass_diag from init_emcorr. The arguments passed are r_theta_levels and r_rho_levels, which is likely why the core dump pointed to r_at_u and r_at_v.

Removing these two erroneous arguments allows the model to run correctly. This bug affects not just aqua-planet runs, but any run started from a dump that does not contain the total-energy and total-mass diagnostics. Most climate dumps would have these variables, so most users would never encounter this bug. The aqua-planet dump that I was provided does not contain these fields, though, which triggers the bug.

There is already a Met Office ticket and branch to fix this bug (3816 in the Met Office database). The branch to fix it is available on puma as

fcm:um_br/dev/malcolm/VN8.0_energy_corr_init_fix_ukmo

When I include this branch, the job runs fine. This is now job xhccj in the UMUI, in case anyone else comes looking for a GA3.0 aqua-planet job.

The bug is present in all vn7.x code that I examined, though, so it's possible that another user may still run into this bug when using an older model version. There are no branches (that I can find) to fix it in these older versions. It's a simple fix, though; it involves removing just one line of code from init_emcorr.

I think we can go ahead and close this ticket now, finally.

comment:13 Changed 8 years ago by annette

  • Resolution set to fixed
  • Status changed from accepted to closed
Note: See TracTickets for help on using tickets.