#2506 closed help (fixed)

error in glue_conv

Reported by: ggxmy Owned by: um_support
Component: UM Model Keywords: glue_conv
Cc: Platform: ARCHER
UM Version: 8.2

Description

I'm trying to run my vn8.2 limited area job, tewng. My copy of xlhub, tewnb, has run for 1 day, and the only difference from tewnb is that tewng uses the vegetation fraction ancillary file that I created based on satellite data. I've got messages like these near the beginning of /home/n02/n02/masara/output/tewng000.tewng.d18169.t165516.leave.20180619-141416 ;

Rank 65 [Tue Jun 19 15:13:58 2018] [c6-1c0s5n0] application called MPI_Abort(comm=0x84000006, 9) - process 53
Application 31184508 is crashing. ATP analysis proceeding...

ATP Stack walkback for Rank 65 starting:
  [empty]@0x7ffff5e9563f
  ni_conv_ctl__cray$mt$p0001@ni_conv_ctl.f90:2305
  glue_conv$glue_conv_mod_@glue_conv-gconv4a.f90:19
  ereport64$ereport_mod_@ereport_mod.f90:53
  gc_abort_@gc_abort.F90:137
  mpl_abort_@mpl_abort.F90:46
  pmpi_abort__@0x162e34c
  MPI_Abort@0x16a0efd
  MPID_Abort@0x16ea291
  abort@abort.c:92
  raise@pt-raise.c:42
ATP Stack walkback for Rank 65 done
Process died with signal 6: 'Aborted'
Forcing core dumps of ranks 65, 8, 24, 0
View application merged backtrace tree with: stat-view atpMergedBT.dot
You may need to: module load stat

_pmiu_daemon(SIGCHLD): [NID 02711] [c6-1c0s5n3] [Tue Jun 19 15:14:08 2018] PE RANK 25 exit signal Killed
[NID 02711] 2018-06-19 15:14:08 Apid 31184508: initiated application termination
tewng: Run failed
*****************************************************************
   Ending script   :   qsatmos
   Completion code :   137
   Completion time :   Tue Jun 19 15:14:14 BST 2018
*****************************************************************

/work/n02/n02/masara/um/tewng/bin/qsmaster: Failed in qsatmos in job tewng
***************************************************************
   Starting script :   qsfinal
   Starting time   :   Tue Jun 19 15:14:14 BST 2018
***************************************************************

I don't see any error report near the end. Above may read that the run crashed at or after time step 90 as soon as glue_conv was called in process 53. This job uses executable generated in tewne and so the built codes used here should be those for tewne.

Can you see any clue for the problem?

Thanks,
Masaru

Change History (3)

comment:1 Changed 12 months ago by ggxmy

In tewng, I tried to run reconfiguration and simulation using the reconfiguration executable generated in tewnc and the model executable generated in tewne. In these jobs the original ancillary was used. Did I actually need to change the ancillary in these jobs as well?

Masaru

comment:2 Changed 12 months ago by simon

All jobs are independent, so there's no need to change anything in tewnc and tewne. In fact, you're updating the ancil, so it is only read in as the model runs, so the reconfiguration isn't doing anything. As it's a time-invariant field, it should really be configured. This may help, as the reconfiguration might adjust the other fields on tiles to account for the changes in veg fractions.

The model is going unstable on process 53 (which contains the point in the exact centre of the domain, I don't know if this is significant, or not).

There could be many reasons for the failure. You could try checking the ancil, I'm guessing that all 9 veg fractions should add up to 1. at each point. You could also try halving the timestep, or a different initial dump to see that helps.

comment:3 Changed 12 months ago by ggxmy

  • Resolution set to fixed
  • Status changed from new to closed

Hi Simon,

Thank you for the advice. That turned out to be a bull's-eye. Even though I tried to be very careful, the veg fractions did not add up to 1 in many grid points. I found that was due to I used a condition like "if (fveg_total eq 1.0)" to find the land point in my code (IDL). Many 1.0000 were equal to 1.0 but some 1.0000 actually seemed to be considered larger than 1.0 and others smaller. Wrong values went through the checking because I used the same conditional statement. It was quite puzzling and I tried so many things before I found this out but all I had to do was change the condition to something like "if (fveg_total gt 0.99999 and fveg_total lt 1.00001)".

So I regenerated the ancillary file and UM ran OK with it. Thanks again for your help!

Masaru

Note: See TracTickets for help on using tickets.