Opened 10 months ago

Closed 9 months ago

#3292 closed error (answered)

jobs failing

Reported by: luciad Owned by: um_support
Component: PUMA Keywords:
Cc: Platform: ARCHER
UM Version: 11.2

Description

Hello,

I am having issues with submitting jobs from pumatest.
I have changed the host in archer.rc to login.archer.ac.uk, so that is not the issue.

I noticed that I have to reinitialize the host everyday,and even then I get submit-failed or submit-retrying.

After they are submitted and running, several jobs fail with the following error, which I'm guessing is linked to failed connection between pumatest and archer:
???!!!???!!!???!!!???!!!???!!! ERROR ???!!!???!!!???!!!???!!!???!!!
? Error code: 2
? Error from routine: U_MODEL_4A
? Error message: ACUMPS1: Partial sum file inconsistent. See Output
? Error from processor: 48
? Error number: 28
????????????????????????????????????????????????????????????????????????????????

I haven't had these issues before Archer and pumatest went offline, and I have followed all the instructions after, so I would really need a hand to make my runs work properly.

Thank you for your help,
Lucia

Change History (6)

comment:1 Changed 10 months ago by ros

Hi Lucia,

What suite id please? Also cycle where this has occurred.

The ssh connection to ARCHER will only survive until around 5am and you will need to login again each morning to reconnect, the suite will then pick up automatically. This is only a workaround solution and we are working with ARCHER on a proper solution.

Regards,
Ros.

comment:2 Changed 10 months ago by luciad

Hi Ros,

The suite id is u-bv018 and the fail happened at atmos_main (eg. /work/n02/n02/luciad/cylc-run/u-bv018/log/job/20170601T0000Z/atmos_main080). I've submitted the failed members again now.

So there is no need to do ~um/um-training/setup-archer-hosts?
I think I tried without it once and it didn't work, but I will test it tomorrow with only ssh-add and ssh -X login.archer.ac.uk.

Lucia

comment:3 Changed 10 months ago by grenville

Lucia

The problem is in the climate meaning - can you switch off climate meaning for this ensemble member?

If not, you will need to back up and rerun atmos_main080 in a previous cyle to regenerate the partial sum files - or start a new run for that ensemble member from 20170601T0000Z.

Grenville

comment:4 Changed 10 months ago by luciad

Hi Grenville,

Thank you for looking into it.
Is it possible that this happened because the previous cycle failed to poll at the end of the run? I think I've forced it because the run froze.

I've unchecked the climate meaning and reloaded the suite to see if your suggestion will fix the issues.

Regards,
Lucia

comment:5 Changed 10 months ago by grenville

Lucia

I'm guessing that the failure of the fist try as a result of exceeding the wallclock time left the partial sum file in a bad state (just a guess 'though)

Grenville

comment:6 Changed 9 months ago by grenville

  • Resolution set to answered
  • Status changed from new to closed

closed through inactivity

Note: See TracTickets for help on using tickets.