Opened 6 years ago

Closed 5 years ago

#1167 closed help (fixed)

HiGEM runs crashed, reason unknown

Reported by: till Owned by: willie
Component: UM Model Keywords: HiGEM crash
Cc: Platform: HECToR
UM Version: 6.1

Description

Dear CMS Team,
one of my HiGEM jobs has crashed again I'm afraid. This time it's xgvwt. All of the jobs crash every once in a while, but this time the usual remedies don't work. I've changed the target diffusion parameter, and I've also changed the timestep for the ice advection. The last .leave file of xgvwt:
/home/n02/n02/till/um/umui_out/xgvwt000.xgvwt.d13315.t102050.leave
contains all the extended diagnostics. Maybe this can help you to find out what went wrong? Any hints are welcome!
Thank you
Till

Change History (6)

comment:1 Changed 6 years ago by willie

Hi Till,

Your program ran successfully for 648 time steps and was then killed. This indicates that it ran out of memory. You could try increasing the number of processors from 8x16 to 16x32 say.

Regards,

Willie

comment:2 Changed 6 years ago by till

Hi Willie,
thank you for that hint. Unfortunately xgvwt doesn't run with the new 16 x 32 configuration, the .leave file says: "Attempting to use an MPI routine before initializing MPICH". Is there anything else I should change in the UMUI?

comment:3 Changed 6 years ago by willie

Hi Till,

In the leave file it says

aprun -n 512 -N 32 -S 8 -ss /work/n02/n02/till/xgvwg/xgvwg_FLUSH.exec
 Error : MAXPROC is not big enough.
 You will need to edit the parameter in comdeck PARPARM.
 MAXPROC=  288  nproc_max=  512

So there is a limit of 288 processors for this job, but you could go as far as 16x16.

Another alternative is to run with reduced cores per node so that there is more memory. This is done in Sub model Independent > Job submission, resources … Tick "use non-default number of cores per node" and enter 16 in the box, say.

Regards,

Willie

comment:4 Changed 6 years ago by annette

  • Owner changed from um_support to willie
  • Status changed from new to assigned

comment:5 Changed 5 years ago by till

Hi Willie,
in the meantime I have found another error in this run (which is my fault), so unfortunately I need to restart it anyway. Hopefully it runs more smoothly this time!
Cheers
Till

comment:6 Changed 5 years ago by willie

  • Resolution set to fixed
  • Status changed from assigned to closed
Note: See TracTickets for help on using tickets.