Opened 3 years ago

Closed 3 years ago

#1963 closed help (answered)

Run failure

Reported by: simon.tett Owned by: um_support
Component: UM Model Keywords:
Cc: Platform: ARCHER
UM Version: 8.5

Description

Hi,

I have had a couple of runs fail last night with the following error:

aprun: Apid 23028636: close of the compute node connection after app startup barrier (local fd 9, port 47523)

aprun: Apid 23028636: Exiting due to errors. Application aborted

error message interspersed with output from archiving.

See ~stett2/output/xmvpa012.xmvpa.d16244.t000008.leave on archer.

What is the cause? and how can I avoid ?

Change History (2)

comment:1 Changed 3 years ago by ros

Hi Simon,

Please see the user mailing from ARCHER.

Regards,
Ros


Thursday 1st September, mom Node Failures

Dear Users,

Unfortunately we experienced two mom node failures yesterday and then a further two overnight.

This means that some running jobs will have failed and may still be within the queue system in a hung state. Our system teams are investigating the cause and further information will be provided when available on the ARCHER status page (https://www.archer.ac.uk/status/) and via user mailings.

Failing jobs do not appear to have been charged but if you think your jobs have been charged, then please contact the helpdesk and we can arrange for a refund to be applied to your project. You can check your usage by following the instructions found at: https://www.archer.ac.uk/documentation/safe-guide/safe-guide-users.php#uhist.

We apologise for the inconvenience caused and please contact the helpdesk if you require any assistance.

Regards,

The ARCHER Helpdesk Team
support@…

comment:2 Changed 3 years ago by grenville

  • Resolution set to answered
  • Status changed from new to closed
Note: See TracTickets for help on using tickets.