Opened 12 years ago

Closed 12 years ago

#148 closed error (fixed)

Tic code for running NAE domain at 76 levels on HECToR

Reported by: oma Owned by: jeff
Component: UM Mesoscale Keywords:
Cc: l.steenman-clark@… Platform:
UM Version:

Description

Dear All,

I'm trying to run the UM over the NAE and 76 vertical levels on HECToR.

I was using 16 processors E-W and 8 N-S on HPCx for the same job.

On HECToR I tried using the Tic code n02-weat but I received the following error message:

oma not a member of budget n02-weat

Job terminated on failure of budget access/allocation validation
Could not run prolog: pro/epilogue failed, file: /var/spool/PBS/mom_priv/prologue, exit: 1, nonzero p/e exit status

Could you let me know what is the appropriate Tic code to use in this case and give me access to it?

Kind regards and thanks,

Oscar

Attachments (2)

xddmb000.xddmb.d08189.t142028.leave (143.3 KB) - added by oma 12 years ago.
xddmb000.xddmb.d08190.t103214.leave (84.0 KB) - added by oma 12 years ago.

Download all attachments as: .zip

Change History (8)

comment:1 follow-up: Changed 12 years ago by jeff

  • Cc l.steenman-clark@… added
  • Owner changed from um_support to jeff
  • Status changed from new to assigned

Hi Oscar

I've given you access to the n02-weat TIC code, so try your job again.

Jeff.

comment:2 in reply to: ↑ 1 Changed 12 years ago by oma

Hi Jeff

Thank you for giving me access to n02-weat. I ran the model again and the reconfiguration seems to have worked fine. However, the model itself failed. I'm copying below the relevant piece of the file

/home/n02/n02/oma/um/umui_out/xddmb000.xddmb.d08188.t022231.leave

where the error is shown. I hope you can help me to identify the problem.

Kind regards,

Oscar

*

Job started at : Mon Jul 7 00:44:12 BST 2008
Run started from UMUI
Running from control files in /home/n02/n02/oma/umui_runs/xddmb-188022312

NAE(720432) - Feb02 - Not ready yet
This job is running on machine nid00004,
using UM directory /work/n02/n02/hum,
and test directory /work/n02/n02/hum/umtest.
*

Starting script : qsexecute
Starting time : Mon Jul 7 00:44:12 BST 2008

*

/work/n02/n02/oma/tmp/tmp.nid00004.3054/modscr_xddmb/qsexecute: Executing setup

/work/n02/n02/hum/vn6.1/pathscale/scripts/qssetup: Job terminated normally

xddmb: Starting run

[0] MPIDI_Portals_Progress: dropped event on "other" queue, increase

[0] queue size by setting the environment variable MPICH_PTL_OTHER_EVENTS

aborting job:

Dropped Portals event

[NID 12725]Apid 196588: initiated application termination

diff: /work/n02/n02/oma/tmp/tmp.nid00004.3054/xddmb.xhist: No such file or directory

qsexecute: Copying /work/n02/n02/oma/xddmb/xddmb.thist to backup thist file /work/n02/n02/oma/xddmb/xddmb.thist_keep

xddmb: Run failed
*

Ending script : qsexecute
Completion code : 137
Completion time : Mon Jul 7 00:44:35 BST 2008

*

comment:3 Changed 12 years ago by jeff

Hi Oscar

You need to include this mod $PUM_MODS61/hector_io.mf77, hopefully this will fix the problem and it should also make the model run faster. You should probably also include these mods as well $PUM_MODS61/pum_full_6.1.mf90 and $PUM_MODS61/pum_full_6.1.mh. The last two should also be used when compiling the reconfiguration.

Jeff.

Changed 12 years ago by oma

Changed 12 years ago by oma

comment:4 Changed 12 years ago by oma

Hi Jeff,

I'm sorry for coming back to the same issue again.

I've added the three mods you suggested, but I received the same error message:

xddmb: Starting run

[0] MPIDI_Portals_Progress: dropped event on "other" queue, increase

[0] queue size by setting the environment variable MPICH_PTL_OTHER_EVENTS

aborting job:

Dropped Portals event

[NID 1268]Apid 198212: initiated application termination

I'm attaching the last two ,leave files with (*.t142028.leave) and without reconfiguration (*.t103214.leave). It seems the reconfiguration step is OK. Also the beginning of the run seems fine. It assigns 128 processors and reads the start dump and the model constants but then it stops.

As additional information, the job is xddmb at Meteorology in Reading. The ancillary files were created using the BADC service and I transferred them from HPCx (where I this case runs fine) to HECToR.

Thanks,

Oscar

comment:5 Changed 12 years ago by jeff

Hi Oscar

To increase the value of MPICH_PTL_OTHER_EVENTS, goto umui panel

Sub-Model Independent → Script Inserts and Modifications

and add this variable name to the "Defined Environment Variables" panel. The default value of this variable is 2048, so you could try increasing it to 24000.

Jeff.

comment:6 Changed 12 years ago by jeff

  • Resolution set to fixed
  • Status changed from assigned to closed
Note: See TracTickets for help on using tickets.