Opened 12 years ago

Closed 12 years ago

#132 closed help (fixed)

HECToR dump problem

Reported by: jeff Owned by: jeff
Component: UM Model Keywords:
Cc: Platform:
UM Version:

Description

I have now managed to get my code on HECToR running for a complete day, however, it now crashes when writing the dump. I get this error

[0] MPIDI_Portals_Progress: dropped event on unexpected receive queue, increase
[0] queue size by setting the environment variable MPICH_PTL_UNEX_EVENTS
aborting job:
Dropped Portals event
[NID 2124]Apid 131357: initiated application termination
diff: /work/n02/n02/luke/tmp/xcwka.xhist: No such file or directory
qsexecute: Copying /work/n02/n02/luke/xcwk/xcwka.thist to backup thist file /work/n02/n02/luke/xcwk/xcwka.thist_keep
xcwka: Run failed
*

Ending script : qsexecute
Completion code : 137
Completion time : Thu Apr 24 11:38:54 BST 2008

*

with the last output on proc 0 being

DUMPCTL: Opening new file xcwkaa.dah9920 on unit 22

WRITING UNIFIED MODEL DUMP ON UNIT 22
#####################################

and then the code stops - it should actually run for 2 days, not 1. Also, the fields in the dump are all screwed up too (and the *fort6.pe* files are empty - I know that proc 0 goes to the leave, but I'm surprised that the others are empty since they only started being empty yesterday sometime).

Do you have any suggestions as to how to diagnose what is going on - there is no core file produced. I know I've probably messed up somewhere, I just can't think where!

thanks,
Luke

Change History (2)

comment:1 Changed 12 years ago by jeff

  • Status changed from new to assigned

comment:2 in reply to: ↑ description Changed 12 years ago by jeff

  • Resolution set to fixed
  • Status changed from assigned to closed

Replying to jeff:

I have now managed to get my code on HECToR running for a complete day, however, it now crashes when writing the dump. I get this error

[0] MPIDI_Portals_Progress: dropped event on unexpected receive queue, increase
[0] queue size by setting the environment variable MPICH_PTL_UNEX_EVENTS
aborting job:

This is the cause of the crash and an easy way to fix it is to increase the value of MPICH_PTL_UNEX_EVENTS. By default this has a value of 240000, defined in your SUBMIT file. You can override this value by adding the environment variable to the umui via window

Sub-Model Independent → Script Inserts and Modifications

But this may not be the best way to proceed, when we first installed the UM Simon Wilson noticed that writing dumps seemed to take a large proportion of the run time and was caused by a large number of mpi messages being directed at PE0, he fixed this by adding a barrier to the dump writing code so the UM had a chance to handle the messages before a load more were sent. This fix can be found in mod $UMDIR/vn6.1/mods/hector_io.mf77 ($PUM_MODS61/hector_io.mf77 in your umui job). This mod reduces the number of unexpected messages so should help with your problem, try this mod first before you make any changes to the MPICH_PTL_UNEX_EVENTS variable.

Jeff.

Note: See TracTickets for help on using tickets.