Opened 7 years ago

Closed 7 years ago

#1026 closed help (fixed)

Problem with IOS crashing

Reported by: pclark Owned by: um_support
Component: UM Model Keywords:
Cc: Platform: MONSooN
UM Version: 8.1

Description

I'm running xieza on MONSooN. It run's OK on one node (2x16 proc) but ran out of CPU with a 10000 s limit. I therefore decided to run with 2 nodes (4x16 proc). It dies instantly with an error in the IOS code:

Traceback:

Offset 0x00000010 in procedure xltrbk_
Offset 0x00000188 in procedure
ios_mpi_error_NMOD_ios_mpi_error_handler_, near line 28 in file /projects/dymecs/paclar/xiezc/umatmos/ppsrc/UM/io_services/common/ios_mpi_error.f90
Offset 0x00000228 in procedure _do_error
Offset 0x00000034 in procedure do_mpci_error
Offset 0x000007a8 in procedure MPIBsend
Offset 0x0000006c in procedure mpi
bsend
Offset 0x0000005c in procedure mpl_bsend_, near line 56 in file /home_proj_work/home/nwp/nm/frml/GCOM4.1/meto_ibm_pwr6_mpp/ppsrc/gcom/mpl/mpl_bsend.f90
Offset 0x00000078 in procedure gc_rsend_, near line 158 in file /home_proj_work/home/nwp/nm/frml/GCOM4.1/meto_ibm_pwr6_mpp/ppsrc/gcom/gc/gc_rsend.f90
Offset 0x00000948 in procedure idl_random_perturb_, near line 230 in file /projects/dymecs/paclar/xiezc/umatmos/ppsrc/UM/atmosphere/dynamics_advection/idl_random_perturb.f90
Offset 0x00008b64 in procedure idl_initial_data_, near line 958 in file /projects/dymecs/paclar/xiezc/umatmos/ppsrc/UM/atmosphere/dynamics_advection/idl_initial_data.f90
Offset 0x00001194 in procedure idl_ni_init_, near line 951 in file /projects/dymecs/paclar/xiezc/umatmos/ppsrc/UM/atmosphere/dynamics_advection/idl_ni_init.f90
Offset 0x0005ba54 in procedure atm_step_, near line 4861 in file /projects/dymecs/paclar/xiezc/umatmos/ppsrc/UM/control/top_level/atm_step.f90
Offset 0x00152878 in procedure u_model_, near line 3708 in file /projects/dymecs/paclar/xiezc/umatmos/ppsrc/UM/control/top_level/u_model.f90
Offset 0x00001f70 in procedure um_shell_, near line 2258 in file /projects/dymecs/paclar/xiezc/umatmos/ppsrc/UM/control/top_level/um_shell.f90
Offset 0x00000090 in procedure flumemain, near line 46 in file /projects/dymecs/paclar/xiezc/umatmos/ppsrc/UM/control/top_level/flumeMain.f90
—- End of call chain —-

ERROR: 0031-250 task 0: IOT/Abort trap

At the end of the output I have:

An error occured inside the MPI library during an operation
on the IOS↔Atmos communicator for normal ops
IOS_MPI_ERROR: MPI_COMMUNICATOR= 0 MPI_ERROR_CODE= 165 aborting…

I have to confess that IOS is new to me and the UMUI page is one of the more obtuse.
Full output is at xieza000.xieza.d13050.t170459.leave

Any help gratefully received.

Change History (3)

comment:1 Changed 7 years ago by grenville

Hi Peter

Please take a copy of /home/grenville/um_vn7.1/VN7.1_ideal/src/atmosphere/dynamics_advection/idl_random_perturb.F90 and use that instead of the one you have. It uses a different way of scattering data which doesn't cause the bsend error. Chris Holloway had the same problem recently (I had it several years ago) and this code fixed it for both of us (I can't remember who told me this solution). My code seems to have an extra argument (g_datastart) in the parameter list of IDL_random_perturb which not present in the vn 8.1 version, but it's not used anywhere in the body of the routine so can safely be removed.

Grenville

comment:2 Changed 7 years ago by pclark

Thanks Grenville

I needed to bring make a few minor additions to work with 8.1 code, but now all's well.

I've created a branch
fcm:um_br/dev/pclark/vn8.1_idealized/src at revision [11147]
if anyone else needs it. I'll liaise with Carol Halliwell to get this into the Met Office trunk at some stage.

comment:3 Changed 7 years ago by grenville

  • Resolution set to fixed
  • Status changed from new to closed
Note: See TracTickets for help on using tickets.