Opened 11 years ago

Closed 9 years ago

#365 closed help (fixed)

4km job won't run

Reported by: anmcr Owned by: grenville
Component: UM Mesoscale Keywords:
Cc: Platform:
UM Version: 6.1

Description

Hello,

I modified the standard UK 4km 100x100x38 to run over a section of Antarctica. I seem to be able to reconfigure my start dump, but I can't get the model run to take place. Unfortunately there is little information from the .leave file. To keep the run simple I only reconfigured the orography and land-sea mask ancillaries. All other ancillaries were set to 'not used'. I would be very grateful if someone could please have a look. The job id is xdvhs. My username on hector is anmcr.

Thanks,

Andrew

Change History (5)

comment:1 Changed 11 years ago by grenville

  • Owner changed from um_support to grenville
  • Status changed from new to assigned

Andrew

You're job is quite big. Try increasing the number of processors - I have a slightly bigger job that runs with 12x16 processors. The error (137) returned when your job failed indicated a memory problem.

Grenville

comment:2 Changed 11 years ago by anmcr

Dear Grenville,

I increased the number of processors as you suggested and that worked.

But I immediately got another error which I am not familiar with and with little information as to what was wrong. See below.

Do you have an idea? The reconfiguration seemed to go fine, and the fault was immediately when the run started. I don't know what signal failure PE 115 refers to.

Thanks,

Andrew


xdvhs: Starting run

Segmentation fault! Fault address: (nil)

This is likely to have been caused by either a null pointer dereference or a general protection fault.
_pmii_daemon(SIGCHLD): PE 115 exit signal Aborted
[NID 4168]Apid 1522036: initiated application termination
diff: /work/n02/n02/anmcr/tmp/tmp.nid00004.21989/xdvhs.xhist: No such file or directory
qsexecute: Copying /work/n02/n02/anmcr/xdvhs/xdvhs.thist to backup thist file /work/n02/n02/anmcr/xdvhs/xdvhs.thist_keep
xdvhs: Run failed
*

Ending script : qsexecute
Completion code : 137
Completion time : Thu Jan 7 11:30:15 GMT 2010

*

comment:3 Changed 11 years ago by willie

Hi Andrew,

I increased the number of procs to 16x8 and got the failure above. Flushing the output shows that the model fails to converge in the first time step. This can indicate incorrect surface fields, so you may have to configure some ancillaries.

I hope that helps.

WIllie

comment:4 Changed 11 years ago by anmcr

Thanks Willie,

Your suggestion worked and the model run completed successfully.

Andrew

comment:5 Changed 9 years ago by ros

  • Resolution set to fixed
  • Status changed from assigned to closed
Note: See TracTickets for help on using tickets.