Opened 10 years ago

Closed 10 years ago

#504 closed help (fixed)

Problem running 1.5 km UM7.4 on Hector

Reported by: pclark Owned by: um_support
Component: UM Model Keywords: LAM failure
Cc: Platform:
UM Version: 7.4

Description

I'm trying to run essentially an 'HRTM' setup at 1.5 km; this is pretty much plain vanilla UM LAM apart from 76 levels.
My UM exec and reconfig execs are built using xfgwa with no apparent problems.
My reconfiguration job (which doesn't do anything apart from bring my dump up to 7.4) is xfgwf and runs with no apparent problems.
When I run the model, job xfgwg, it immediately falls over with 'Floating point exception signaled at 934068: integer divide by zero'
(see /home/n02/n02/paclark/um/umui_out/xfgwg000.xfgwg.d10265.t094105.leave)

Judging from the output, I think it's probably falling over in 'UM_INDEX', which is either pretty fatal or suggests I've done something really silly, but I cannot see what.

I'm not familiar with debugging with e.g. totalview on Hector - any guidance gratefully received. Has anyone got a working LAM I can compare with?

Change History (10)

comment:1 Changed 10 years ago by willie

Hi Peter,

If you have a core file you can use

  gdb <dir spec to executable> core

and then type "where". Could you also give read permission for the um/umui_out directory and I'll have a further look.

Regards,

Willie

comment:2 Changed 10 years ago by pclark

Hi Willie

You should have read permission now.

Not sure how to generate a core on Hector - I have SAVECORE=true in SCRIPT, but I cannot find a core file. What else is needed?

Thanks

Pete

comment:3 Changed 10 years ago by willie

Hi Peter,

To get a core just include the branch

fcm:um-br/dev/ros/VN7.4_generate_core/src

This modifies the qsexecute script by adding the command 'ulimit -c unlimited'.

Please could you cd to your home directory and type

chmod g+rx .

to give me the read permission.

Regards,

Willie

comment:4 Changed 10 years ago by pclark

I'm confused now - I already included fcm:um-br/dev/ros/VN7.4_generate_core/src, to no effect. This doesn't surprise me, as FCM only does an extract when one is building an exec, which I'm not (already built in xfgwa), and, of course, xfgwg doesn't use xfgwa's qsexecute script.

I'll make my home dir readable when I can log on again - Hector's now down for maintenance (sigh!).

Thanks

Pete

comment:5 Changed 10 years ago by ros

Hi Pete,

You can force a run-only job to re-extract all the UM scripts and apply any modifications by selecting the switch "Enable build of UM scripts" in the UMUI window Compilations and Modifications → UM Scripts Build

You should then get a corefile.

Cheers,
Ros.

comment:6 Changed 10 years ago by pclark

Doh! I didn't know that! I'll rerun when Hector is up.

comment:7 Changed 10 years ago by pclark

Hi Willie and Ros
Hopefully you can see my directories now.

I've tried running with "Enable build of UM scripts" - I now get an extract error:

pclark@puma:~> cat /home/pclark/um/um_extracts/xfgwg/umbase/ext.out
Extract command started on Fri Sep 24 09:49:52 2010.
->Parse configuration: start
Config file (ext): svn://puma/UM_svn/UM/branches/dev/um/VN7.4_machine_cfg/src/configs/bindings/container.cfg@2109
Config file (ext): svn://puma/UM_svn/UM/branches/dev/um/VN7.4_machine_cfg/src/configs/machines/cray-xt4-pathscale-hector/machine.cfg@2109
Config file (ext): /dev/null
Unable to read config file "/home/pclark/umui_jobs/xfgwg/FCM_UMUI_BASE_CFG", abort at /home/um/fcm/bin/../lib/Fcm/ConfigSystem.pm line 528

(I do have inc ~um/fcm/etc/um_revisions.cfg in my .fcm, BTW)

Cheers Pete

comment:8 Changed 10 years ago by ros

Hi Pete,

That'll be a bug in the UMUI, it hasn't created the necessary files it needs to do an extract! I'll investigate, but in the meantime I'll send you an email with a workaround.

Regards,
Ros.

comment:9 Changed 10 years ago by pclark

Problem solved. (It was me!)

In trying to sort out the 'namelist' error (see earlier ticket' I reverted to 3A radiation in my run job, using a different build. having sorted that I reverted to my original build (3C radiation) but forgot to change the run job. Sorry to have wasted your time, but thanks for the help as I wouldn't have tracked this down otherwise.

Oddly, with Ros' workaround I got a core file once but on re-running didn't (even after deleting original). Nevermind.

Thanks

comment:10 Changed 10 years ago by willie

  • Resolution set to fixed
  • Status changed from new to closed
Note: See TracTickets for help on using tickets.