Opened 9 years ago

Closed 9 years ago

#802 closed help (fixed)

Automatic continuation run stopped and will not re-start

Reported by: eelsj Owned by: willie
Component: UM Model Keywords: ocean, instability
Cc: Platform:
UM Version: 6.6.3

Description

Hello Helpdesk,

The job XGTEC has been running on Hector using UM version 6.6.3 with automatic continuation over the last week or two. On Sunday morning it stopped part way through a month and failed to automatically resubmit.

I have tried resubmitting the job but PUMA appears to "hang" on the resubmit screen after the message …

Calling FCM_MAIN_SCR - local…
(This may take several minutes.)
Checking remote run directory …

I have checked disk space on Hector Safe and there appears to be sufficient space:

home: Usage 14 Gb home: Quota 100 Gb home: Files 155,890 Files

work (esfs1): Usage 328 Gb work (esfs1): Quota 502 Gb work (esfs1): Files 595 Files
work (esfs2): Usage 1 Gb work (esfs2): Files 4 Files

I have also moved results files off /home/n02/n02/eelsj/work this morning but it had not effect.

The .leave file(/home/n02/n02/eelsj/um/umui_out/xgtec075.xgtec.d12064.t051405.leave) gives the following messages which did not help me much:

42:xgtec: Run failed
53:/work/n02/n02/eelsj/xgtec/bin/qsmaster: Failed in qsmaster in model xgtec
59:/work/n02/n02/eelsj/xgtec/bin/qsfinal: Error in exit processing after model run
60:Failed in model executable
64:/work/n02/n02/eelsj/xgtec/bin/qsresubmit: Error job not resubmitted because of error in qsmaster
75:/work/n02/n02/eelsj/xgtec/bin/qsmaster: failed in final in model xgtec

I briefly checked some of the results files for the month before the run stopped and they appear sensible.

I'm not sure what to check next! Could you help me please?

Thank you

Lawrence

Change History (6)

comment:1 Changed 9 years ago by ros

Hi Lawrence,

First off, could you check that you can login to HECToR from PUMA (ssh phase3.hector.ac.uk -l <userid>) without the need to enter a password? If you can't then this means your ssh-agent has stopped working correctly and would explain the hanging symptom you're seeing with the UMUI. If this is the case then you'll need to fix the ssh problem first.

1) Remove your $HOME/.ssh/environment.puma file on PUMA.
2) Logout and back into PUMA. You should see a message something like "Reinitialising ssh-agent"
3) Run the command 'ssh-add' and you should be prompted for a passphrase.
4) Try logging into HECToR from PUMA and check that no password or passphrase is required.

Regards,
Ros.

comment:2 Changed 9 years ago by eelsj

Sorry - I should have thought of that as it's happened to me before! I have submitted XGTEC and will let you know whether or not the resubmit is successful.

Thanks

Lawrence

comment:3 Changed 9 years ago by eelsj

Hi Ros,

I was able to resubmit the run but unfortunately it terminated at the same point as on Sunday morning giving the same messages in the .leave file (xgtec000.xgtec.d12065.t120026.leave).

Lawrence

comment:4 Changed 9 years ago by eelsj

Hello Helpdesk,

Have you made any progress with this problem? I re-ran XGTEC this morning and there was no change in the problem. The .leave file gave the following error message which may be helpful but, unfortunately, I have not been able to decipher it:

xgtec000.xgtec.d12068.t081147.leave

_pmii_daemon(SIGCHLD): [NID 02356] [c5-0c0s5n2] [Thu Mar 8 09:47:59 2012] PE 86 exit signal Segmentation fault
[NID 02356] 2012-03-08 09:47:59 Apid 1758204: initiated application termination
xgtec: Run failed

Thank you,

Lawrence

comment:5 Changed 9 years ago by willie

  • Owner changed from um_support to willie
  • Status changed from new to accepted

Hi Lawrence,

You job runs for 378,072 time steps and seems to be converging normally. However, at this time step you are also getting,

TS=126001 YEAR= 14.58 DAY=210.0 ENERGY= 1.189785E+01 DTEMP= 5.977087E-08 DSALT= 6.600719E-12 SCANS= 625

Global Net CO2 Flux into ocean (GtC/yr) -1.8784164886836388
Global Net CO2 Flux into ocean - 2nd C NaN

and the NaNs? continue for the rest of the run. The NaN's (Not a Number) result from a variety of causes, including poor or invalid data input, incorrect or unstable algorithms, and faults elsewhere in the model. So I am guessing the ocean model has become unstable.

Regards,

Willie

comment:6 Changed 9 years ago by willie

  • Keywords ocean, instability added
  • Resolution set to fixed
  • Status changed from accepted to closed
Note: See TracTickets for help on using tickets.