Opened 9 years ago
Closed 9 years ago
#802 closed help (fixed)
Automatic continuation run stopped and will not re-start
Reported by: | eelsj | Owned by: | willie |
---|---|---|---|
Component: | UM Model | Keywords: | ocean, instability |
Cc: | Platform: | ||
UM Version: | 6.6.3 |
Description
Hello Helpdesk,
The job XGTEC has been running on Hector using UM version 6.6.3 with automatic continuation over the last week or two. On Sunday morning it stopped part way through a month and failed to automatically resubmit.
I have tried resubmitting the job but PUMA appears to "hang" on the resubmit screen after the message …
Calling FCM_MAIN_SCR - local…
(This may take several minutes.)
Checking remote run directory …
I have checked disk space on Hector Safe and there appears to be sufficient space:
home: Usage 14 Gb home: Quota 100 Gb home: Files 155,890 Files
work (esfs1): Usage 328 Gb work (esfs1): Quota 502 Gb work (esfs1): Files 595 Files
work (esfs2): Usage 1 Gb work (esfs2): Files 4 Files
I have also moved results files off /home/n02/n02/eelsj/work this morning but it had not effect.
The .leave file(/home/n02/n02/eelsj/um/umui_out/xgtec075.xgtec.d12064.t051405.leave) gives the following messages which did not help me much:
42:xgtec: Run failed
53:/work/n02/n02/eelsj/xgtec/bin/qsmaster: Failed in qsmaster in model xgtec
59:/work/n02/n02/eelsj/xgtec/bin/qsfinal: Error in exit processing after model run
60:Failed in model executable
64:/work/n02/n02/eelsj/xgtec/bin/qsresubmit: Error job not resubmitted because of error in qsmaster
75:/work/n02/n02/eelsj/xgtec/bin/qsmaster: failed in final in model xgtec
I briefly checked some of the results files for the month before the run stopped and they appear sensible.
I'm not sure what to check next! Could you help me please?
Thank you
Lawrence
Change History (6)
comment:1 Changed 9 years ago by ros
comment:2 Changed 9 years ago by eelsj
Sorry - I should have thought of that as it's happened to me before! I have submitted XGTEC and will let you know whether or not the resubmit is successful.
Thanks
Lawrence
comment:3 Changed 9 years ago by eelsj
Hi Ros,
I was able to resubmit the run but unfortunately it terminated at the same point as on Sunday morning giving the same messages in the .leave file (xgtec000.xgtec.d12065.t120026.leave).
Lawrence
comment:4 Changed 9 years ago by eelsj
Hello Helpdesk,
Have you made any progress with this problem? I re-ran XGTEC this morning and there was no change in the problem. The .leave file gave the following error message which may be helpful but, unfortunately, I have not been able to decipher it:
xgtec000.xgtec.d12068.t081147.leave
_pmii_daemon(SIGCHLD): [NID 02356] [c5-0c0s5n2] [Thu Mar 8 09:47:59 2012] PE 86 exit signal Segmentation fault
[NID 02356] 2012-03-08 09:47:59 Apid 1758204: initiated application termination
xgtec: Run failed
Thank you,
Lawrence
comment:5 Changed 9 years ago by willie
- Owner changed from um_support to willie
- Status changed from new to accepted
Hi Lawrence,
You job runs for 378,072 time steps and seems to be converging normally. However, at this time step you are also getting,
TS=126001 YEAR= 14.58 DAY=210.0 ENERGY= 1.189785E+01 DTEMP= 5.977087E-08 DSALT= 6.600719E-12 SCANS= 625
Global Net CO2 Flux into ocean (GtC/yr) -1.8784164886836388
Global Net CO2 Flux into ocean - 2nd C NaN
and the NaNs? continue for the rest of the run. The NaN's (Not a Number) result from a variety of causes, including poor or invalid data input, incorrect or unstable algorithms, and faults elsewhere in the model. So I am guessing the ocean model has become unstable.
Regards,
Willie
comment:6 Changed 9 years ago by willie
- Keywords ocean, instability added
- Resolution set to fixed
- Status changed from accepted to closed
Hi Lawrence,
First off, could you check that you can login to HECToR from PUMA (ssh phase3.hector.ac.uk -l <userid>) without the need to enter a password? If you can't then this means your ssh-agent has stopped working correctly and would explain the hanging symptom you're seeing with the UMUI. If this is the case then you'll need to fix the ssh problem first.
1) Remove your $HOME/.ssh/environment.puma file on PUMA.
2) Logout and back into PUMA. You should see a message something like "Reinitialising ssh-agent"
3) Run the command 'ssh-add' and you should be prompted for a passphrase.
4) Try logging into HECToR from PUMA and check that no password or passphrase is required.
Regards,
Ros.