Opened 7 days ago

Last modified 30 hours ago

#2493 reopened help

Error in buffin errorCode= 3.

Reported by: ggxmy Owned by: um_support
Priority: high Component: UM Model
Keywords: buffin Cc: g.w.mann@…, j.marsham@…
Platform: ARCHER UM Version: 8.2

Description

Hi. Now I'm trying to run my UM vn8.2 limited area job tewnb, which I created based on xlhub. It crashes after a minute of submission and /home/n02/n02/masara/output/tewnb000.tewnb.d18163.t093415.leave contains the following information. Near the top are messages like these;

???!!!???!!!???!!!???!!!???!!!???!!! ERROR ???!!!???!!!???!!!???!!!???!!!???!!!?
? Error in routine: mppio:buffinApplication 31111189 is crashing. ATP analysis proceeding...

? Error Code:    24
? Error Message:  Error in buffin errorCode= 3.  len= 256 / 256
? Error generated from processor:     0
? This run generated   2 warnings
????????????????????????????????????????????????????????????????????????????????

Rank 0 [Tue Jun 12 09:37:44 2018] [c4-0c1s4n2] application called MPI_Abort(comm=0xC4000001, 9) - process 0

ATP Stack walkback for Rank 0 starting:
  [empty]@0x7ffff961f3a7
  in_bound_@in_bound.f90:1910
  inbounda_@inbounda.f90:4689
  read_flh_@read_flh.f90:61
  buffin64_i$io_@io.f90:1514
  mppio_ereport$io_@io.f90:450
  ereport64$ereport_mod_@ereport_mod.f90:102
  gc_abort_@gc_abort.F90:137
  mpl_abort_@mpl_abort.F90:46
  pmpi_abort_@0x10b6e3c
  PMPI_Abort@0x10d572c
  MPID_Abort@0x10fd5e1
  abort@abort.c:92
  raise@pt-raise.c:42
ATP Stack walkback for Rank 0 done
Process died with signal 6: 'Aborted'
Forcing core dumps of ranks 0, 1, 84, 24
View application merged backtrace tree with: stat-view atpMergedBT.dot
You may need to: module load stat

_pmiu_daemon(SIGCHLD): [NID 00864] [c4-0c1s8n0] [Tue Jun 12 09:37:50 2018] PE RANK 25 exit signal Killed
_pmiu_daemon(SIGCHLD): [NID 00850] [c4-0c1s4n2] [Tue Jun 12 09:37:50 2018] PE RANK 13 exit signal Killed
_pmiu_daemon(SIGCHLD): [NID 00859] [c4-0c1s6n3] [Tue Jun 12 09:37:50 2018] PE RANK 103 exit signal Killed
[NID 00850] 2018-06-12 09:37:50 Apid 31111189: initiated application termination
_pmiu_daemon(SIGCHLD): [NID 00857] [c4-0c1s6n1] [Tue Jun 12 09:37:50 2018] PE RANK 63 exit signal Killed
tewnb: Run failed
*****************************************************************
   Ending script   :   qsatmos
   Completion code :   137
   Completion time :   Tue Jun 12 09:37:56 BST 2018
*****************************************************************

/work/n02/n02/masara/um/tewnb/bin/qsmaster: Failed in qsatmos in job tewnb
***************************************************************
   Starting script :   qsfinal
   Starting time   :   Tue Jun 12 09:37:56 BST 2018
***************************************************************

 STOP  
/work/n02/n02/masara/um/tewnb/bin/qshistprint: Job terminated normally
/work/n02/n02/masara/um/tewnb/bin/qsresubmit: No resubmit requested
*****************************************************************
   Ending script   :   qsfinal
   Completion code :   0
   Completion time :   Tue Jun 12 09:37:56 BST 2018
*****************************************************************

/work/n02/n02/masara/um/tewnb/bin/qsmaster: Failed in qsfinal in job tewnb
 <<<< Information about How Many Lines of Output follow >>>>
 38  lines in main OUTPUT file.
 1537 lines of O/P from pe0.
 <<<<         Lines of Output Information ends          >>>>

And near the bottom are these messages;

OPEN:  File /work/n02/n02/masara/xklhf_makebc/xklhf_1.lbc to be Opened on Unit 125 does not Exist
OPEN:  **WARNING: FILE NOT FOUND
OPEN:  Ignored Request to Open File /work/n02/n02/masara/xklhf_makebc/xklhf_1.lbc for Reading
 ****************** IO Error Report ***********************************
Unit Generating error=  125
---File States --------------------------
Unit  30 open on filename /work/n02/n02/masara/ancils/vn8.2/cascade_12km/qrclim.ozone_L70_O70
  --> Opened from environment variable:OZONE
   --> Read Only:  T  Local:  T  AllLocal:  F  Remote:  F  Broadcast:  T
Unit  35 open on filename /work/n02/n02/masara/ancils/vn8.2/cascade_12km/qrclim.sst
  --> Opened from environment variable:SSTIN
   --> Read Only:  T  Local:  T  AllLocal:  F  Remote:  F  Broadcast:  T
Unit 135 open on filename /work/n02/n02/masara/ancils/vn8.2/cascade_12km/qrparm.veg.frac_hswd
  --> Opened from environment variable:FRACINIT
   --> Read Only:  T  Local:  T  AllLocal:  F  Remote:  F  Broadcast:  T
Unit 136 open on filename /work/n02/n02/masara/ancils/vn8.2/cascade_12km/qrparm.veg.func_hswd
  --> Opened from environment variable:VEGINIT
   --> Read Only:  T  Local:  T  AllLocal:  F  Remote:  F  Broadcast:  T
Unit 154 open on filename /work/n02/n02/masara/ancils/vn8.2/cascade_12km/qrclim.biog70
  --> Opened from environment variable:ARCLBIOG
   --> Read Only:  T  Local:  T  AllLocal:  F  Remote:  F  Broadcast:  T
Unit 155 open on filename /work/n02/n02/masara/ancils/vn8.2/cascade_12km/qrclim.biom70
  --> Opened from environment variable:ARCLBIOM
   --> Read Only:  T  Local:  T  AllLocal:  F  Remote:  F  Broadcast:  T
Unit 156 open on filename /work/n02/n02/masara/ancils/vn8.2/cascade_12km/qrclim.blck70
  --> Opened from environment variable:ARCLBLCK
   --> Read Only:  T  Local:  T  AllLocal:  F  Remote:  F  Broadcast:  T
Unit 157 open on filename /work/n02/n02/masara/ancils/vn8.2/cascade_12km/qrclim.sslt70
  --> Opened from environment variable:ARCLSSLT
   --> Read Only:  T  Local:  T  AllLocal:  F  Remote:  F  Broadcast:  T
Unit 158 open on filename /work/n02/n02/masara/ancils/vn8.2/cascade_12km/qrclim.sulp70
  --> Opened from environment variable:ARCLSULP
   --> Read Only:  T  Local:  T  AllLocal:  F  Remote:  F  Broadcast:  T
Unit 160 open on filename /work/n02/n02/masara/ancils/vn8.2/cascade_12km/qrclim.ocff70
  --> Opened from environment variable:ARCLOCFF
   --> Read Only:  T  Local:  T  AllLocal:  F  Remote:  F  Broadcast:  T
---End File States ----------------------

???!!!???!!!???!!!???!!!???!!!???!!! ERROR ???!!!???!!!???!!!???!!!???!!!???!!!?
? Error in routine: mppio:buffin
? Error Code:    24
? Error Message:  Error in buffin errorCode= 3.  len= 256 / 256
? Error generated from processor:     0
? This run generated   2 warnings
????????????????????????????????????????????????????????????????????????????????

So it looks like it is complaining about these ancillary data. Can you see a problem with the ancillaries? I copied these from /nerc/n02/n02-SWAMMA/wmcginty/ancil.vn8.2/cascade_12km/ and appear to me to be the same (in terms of file names, sizes and permissions). Please could I have an advice on this.

Thanks,
Masaru

Change History (7)

comment:1 Changed 7 days ago by willie

Hi Masara,

It's saying the LBC files are not found. Looking for

/work/n02/n02/masara/xklhf_makebc/xklhf_1.lbc

The xklhf looks like my old job. You need to generate the LBC files for your run.

Regards
Willie

comment:2 Changed 6 days ago by ggxmy

Hi Willie,

I was aware that but I thought it wasn't created because there was a problem with ancillary files. If I need to create that, how can I do that?

You seem to have xklhf_makebc/xklhf_1.lbc in /nerc/n02/n02-SWAMMA/wmcginty/ . Can I copy this over and use it?

Masaru

comment:3 Changed 6 days ago by ggxmy

At this point tewnb is simply my version of xlhub and I only made minimum changes necessary to run the job. So if xklhf_makebc works for xlhub I would guess it should work for tewnb.

So I copied the entire folder /nerc/n02/n02-SWAMMA/wmcginty/xklhf_makebc to /work/n02/n02/masara/ and tried to run the job again. It went slightly longer but then crashed. /home/n02/n02/masara/output/tewnb000.tewnb.d18163.t093415.leave.20180613-114211 contains following messages near the beginning;

???!!!???!!!???!!!???!!!???!!!???!!! ERROR ???!!!???!!!???!!!???!!!???!!!???!!!?
? Error in routine: INITIAL
? Error Code:   101
? Error Message: nc_file_open: NetCDF: HDF error
? Error generated from processor:     0
? This run generated  13 warnings
????????????????????????????????????????????????????????????????????????????????

Rank 0 [Wed Jun 13 11:43:01 2018] [c5-1c2s2n3] application called MPI_Abort(comm=0xC4000001, 9) - process 0
Application 31127117 is crashing. ATP analysis proceeding...

ATP Stack walkback for Rank 0 starting:
  _start@start.S:113
  __libc_start_main@libc-start.c:242
  main@flumeMain.f90:48
  um_shell_@um_shell.f90:2349
  u_model_@u_model.f90:2663
  initial_@initial.f90:5785
  ereport64$ereport_mod_@ereport_mod.f90:102
  gc_abort_@gc_abort.F90:137
  mpl_abort_@mpl_abort.F90:46
  pmpi_abort_@0x10b6e3c
  PMPI_Abort@0x10d572c
  MPID_Abort@0x10fd5e1
  abort@abort.c:92
  raise@pt-raise.c:42
ATP Stack walkback for Rank 0 done
Process died with signal 6: 'Aborted'
Forcing core dumps of ranks 0, 24, 30
View application merged backtrace tree with: stat-view atpMergedBT.dot
You may need to: module load stat

_pmiu_daemon(SIGCHLD): [NID 04491] [c7-2c1s2n3] [Wed Jun 13 11:44:53 2018] PE RANK 25 exit signal Killed
_pmiu_daemon(SIGCHLD): [NID 02638] [c5-1c2s3n2] [Wed Jun 13 11:44:53 2018] PE RANK 84 exit signal Killed
[NID 04491] 2018-06-13 11:44:53 Apid 31127117: initiated application termination
tewnb: Run failed
*****************************************************************
   Ending script   :   qsatmos
   Completion code :   137
   Completion time :   Wed Jun 13 11:44:58 BST 2018
*****************************************************************

/work/n02/n02/masara/um/tewnb/bin/qsmaster: Failed in qsatmos in job tewnb
***************************************************************
   Starting script :   qsfinal
   Starting time   :   Wed Jun 13 11:44:58 BST 2018
***************************************************************

 STOP  
/work/n02/n02/masara/um/tewnb/bin/qshistprint: Job terminated normally
/work/n02/n02/masara/um/tewnb/bin/qsresubmit: No resubmit requested
*****************************************************************
   Ending script   :   qsfinal
   Completion code :   0
   Completion time :   Wed Jun 13 11:44:59 BST 2018
*****************************************************************

/work/n02/n02/masara/um/tewnb/bin/qsmaster: Failed in qsfinal in job tewnb
 <<<< Information about How Many Lines of Output follow >>>>
 38  lines in main OUTPUT file.
 2506 lines of O/P from pe0.
 <<<<         Lines of Output Information ends          >>>>

and the following near the end;

 NCFILE_INIT: Opening new NetCDF file tewnba.pa20110501_00.nc on unit  60
 NCFILE_INIT: Called in initial mode
 Creating netCDF4 classic model file tewnba.pa20110501_00.nc on unit  60
 ERROR: in procedure nc_file_open : NetCDF error number  -101 : NetCDF: HDF error
 Failure in call to NCFILE_INIT

????????????????????????????????????????????????????????????????????????????????
???!!!???!!!???!!!???!!!???!!!???!!! ERROR ???!!!???!!!???!!!???!!!???!!!???!!!?
? Error in routine: INITIAL
? Error Code:   101
? Error Message: nc_file_open: NetCDF: HDF error
? Error generated from processor:     0
? This run generated  13 warnings
????????????????????????????????????????????????????????????????????????????????

tewnba.pa20110501_00 is present in /work/n02/n02/masara/um/tewnb/ . It seems to be created just now although it appears a bit small;

-rw-r--r-- 1 masara n02    8199264 Jun 13 11:43 tewnba.pa20110501_00

Then the model (looks like it is SUBROUTINE nc_file_open) failed to create tewnba.pa20110501_00.nc ? Can you see what could be the problem? It doesn't seem to be the quota issue on /work.

Masaru

comment:4 Changed 4 days ago by willie

Hi Masaru,

This is due to an incorrect module set up in the build job. To resolve this you need to make the following changes to tewne,

  • switch off the hand edits netcdf_8.2_new_execs and set_cce_8.3.3.ed
  • add in the hand edit ~willie/hand_edits/SWAMMA_14Jun2018.ed and switch it on
  • On Compile and Run Options → Compile and run options for atmosphere page, click the "change system default …" button and set the max no compilation processes to 1.

Then build the executable. When I tried initially with six processes, it failed to compile one of the netcdf modules, but setting the number of processes to one gets round this. It builds in just over an hour.

You should increase the processing time (Job submission → Qsub) in the run job tewnb from 3600 to 7200 sec with the rebuilt executable.

See my jobs xobia and xobta for details.

For reference the module set up for UM8.2 with netcdf is

module swap PrgEnv-cray/5.2.82
module swap cray-mpich cray-mpich/7.5.5
module swap cce cce/8.5.8
module load gcc
module load pmi
module load cray-netcdf/4.4.1.1
module load cray-hdf5/1.10.0.1

The issues are typical of the problems encountered when porting older software to new computer.

Regards
Willie

comment:5 Changed 33 hours ago by ggxmy

HI Willie,

Thank you for your help. tewnb is running for 10 minutes now so it is likely to have gone through the initial stage. So it sounds like I need to include this hand edit in future vn8.2 jobs, don't I?

Masaru

comment:6 Changed 33 hours ago by grenville

  • Resolution set to fixed
  • Status changed from new to closed

Masaru

Yes - if you want to run with netcdf output.

Grenville

comment:7 Changed 30 hours ago by ggxmy

  • Resolution fixed deleted
  • Status changed from closed to reopened

oh, is this to get outputs in netcdf? I did get netcdf outputs indeed. I didn't know it is possible at all and was expecting to get .pp files, but .nc files may be better in this case. How do I switch between .nc and .pp? By adding ~willie/hand_edits/SWAMMA_14Jun2018.ed itself?

Thank you.
Masaru

Note: See TracTickets for help on using tickets.