Opened 12 years ago

Closed 12 years ago

#197 closed help (fixed)

Run on Hector crashes at 263 timesteps

Reported by: swr06rjk Owned by: willie
Component: UM Model Keywords: HECToR
Cc: w.mcginty@… Platform:
UM Version: 6.1

Description

I'm trying to run job xdofb on Hector, and it crashes with the following message in the .leave file:

" Segmentation fault! Fault address: (nil)

This is likely to have been caused by either a null pointer dereference or a general protection fault.
_pmii_daemon(SIGCHLD): PE 7 exit signal Aborted"

The output from node 7 just stops at 263 timesteps. If I ask the model to run for 262 timesteps it runs fine without any problems.

I had the job working on HPCx, as xdfpi, so I would have expected it to work OK on Hector.

Change History (5)

comment:1 Changed 12 years ago by willie

  • Owner changed from um_support to willie
  • Status changed from new to assigned

Richard,
OK I'll take a look at this
Regards
Willie

comment:2 Changed 12 years ago by lois

  • Cc w.mcginty@… added

Hello Richard, it looks as though you don't have the minimum mods needed to run on HECToR. This set of mods is

script mods : $PUM_MODS61/pum_full_6.1.mu
reconfiguration/model mods : $PUM_MODS61/pum_full_6.1.mu

$PUM_MODS61/pum_full_6.1.mf77
$PUM_MODS61/pum_full_6.1.mf90

If you include these mods and try running it again, hopefully all should be solved

Lois

comment:3 Changed 12 years ago by willie

Richard,

I note that you have specified 14 boundary layer levels in the vertical, but in the ratio table there are only 13 entries. See Atmos > Scentific > Scetion by Section > Boudary layers.

Do a check setup in the UMUI reveals a few problems. These should be eliminated before doing a run.

Let me know if this solves the problem.

Regards
Willie

comment:4 Changed 12 years ago by willie

Richard,

I've now run your job for over 50,000 times steps. The script update should only appear in the script sections and not in the modsets for reconfiguration or the model. I have not found any duplicate Fortran files and I have run the reconfiguration and model sequentially in one run. The model does crash however and there is one type of error "error halo_j too small 3" which occurs three times. This looks like a science issue.

Regards,
Willie

comment:5 Changed 12 years ago by willie

  • Resolution set to fixed
  • Status changed from assigned to closed
Note: See TracTickets for help on using tickets.