Opened 5 years ago
Closed 5 years ago
#1963 closed help (answered)
Run failure
Reported by: | simon.tett | Owned by: | um_support |
---|---|---|---|
Component: | UM Model | Keywords: | |
Cc: | Platform: | ARCHER | |
UM Version: | 8.5 |
Description
Hi,
I have had a couple of runs fail last night with the following error:
aprun: Apid 23028636: close of the compute node connection after app startup barrier (local fd 9, port 47523)
aprun: Apid 23028636: Exiting due to errors. Application aborted
error message interspersed with output from archiving.
See ~stett2/output/xmvpa012.xmvpa.d16244.t000008.leave on archer.
What is the cause? and how can I avoid ?
Change History (2)
comment:1 Changed 5 years ago by ros
comment:2 Changed 5 years ago by grenville
- Resolution set to answered
- Status changed from new to closed
Note: See
TracTickets for help on using
tickets.
Hi Simon,
Please see the user mailing from ARCHER.
Regards,
Ros
Thursday 1st September, mom Node Failures
Dear Users,
Unfortunately we experienced two mom node failures yesterday and then a further two overnight.
This means that some running jobs will have failed and may still be within the queue system in a hung state. Our system teams are investigating the cause and further information will be provided when available on the ARCHER status page (https://www.archer.ac.uk/status/) and via user mailings.
Failing jobs do not appear to have been charged but if you think your jobs have been charged, then please contact the helpdesk and we can arrange for a refund to be applied to your project. You can check your usage by following the instructions found at: https://www.archer.ac.uk/documentation/safe-guide/safe-guide-users.php#uhist.
We apologise for the inconvenience caused and please contact the helpdesk if you require any assistance.
Regards,
The ARCHER Helpdesk Team
support@…