Opened 7 months ago

Closed 6 months ago

Last modified 3 months ago

#3335 closed help (answered)

wallclock exceeded

Reported by: eelrm Owned by: ros
Component: UKESM Keywords:
Cc: Platform: ARCHER
UM Version: 11.2

Description

Hello,

I'm running several suites, which are exceeding their walltime for various tasks e.g. pptransfer (u-bv020, u-bu651) and postproc_cice (u-bv669). They've run with the same allocated walltime for previous cycles. Should I increase the walltime or is this a load problem?

Thanks,
Lauren

Change History (13)

comment:1 Changed 7 months ago by eelrm

Hello,

Just to add that the pptransfer task lists as running in the gui, but the rsync never appears to start. I have managed to get a few cycles through just by re-triggering, but most of them are continuing to fail.

Thanks,

Lauren

comment:2 Changed 7 months ago by ros

Hi Lauren,

I've found a few disk quota error messages in the pptransfer in u-bu651

[FAIL] [Errno 122] Disk quota exceeded

and

[FAIL] disk I/O error

Can you check that you have enough space on JASMIN GWS? Your ARCHER quota looks ok.

I'll take a look at postproc_cice in a bit.

Cheers,
Ros.

comment:3 Changed 7 months ago by eelrm

Hi Ros,

Thank you. I managed to get the postproc_cice through after several attempts. I had exceeded my /work quota previously, but it shouldn't be a problem now. Jasmin GWS is also fine.

Thanks,

Lauren

comment:4 Changed 7 months ago by eelrm

Hi Ros, I'm also finding that my archer key doesn't persist for the whole day any more. Could this be contributing?

Lauren

comment:5 Changed 7 months ago by ros

Hi Lauren,

This is the connection from PUMA to ARCHER? If so that won't affect tasks that are already running on ARCHER. I think Grenville has already been in touch about generating a separate key for UM submissions which should help with connections persisting.

Can you also check that your connection from espp01 and espp02 to the jasmin-xfer* node is still ok?

Cheers,
Ros.

comment:6 Changed 7 months ago by eelrm

Hi Ros,

Ah of course. Yes, I can log in without password/passphrase from espp* to xfer*.

Thanks,
Lauren

comment:7 Changed 6 months ago by eelrm

Hi Ros,

I'm still having trouble with this. I have some new suites that the pptransfer has worked fine (e.g. u-bw758), but the original suites are are still getting stuck (u-bu651,u-bv669,u-bv020). Can I do the transfer manually?

Thanks,

Lauren

comment:8 Changed 6 months ago by ros

  • Owner changed from um_support to ros
  • Status changed from new to accepted

Hi Lauren,

I can see the problem now the cycles that are "stuck" contain much larger amounts of data to other cycles. For example u-bu651 has 737Gb to transfer which it won't manage in 30mins. Most cycles seem to be of the order 50Gb. Looking at the data there are some dodgey files:

683G /work/n02/n02/eelrm/archive/u-bu651/21991001T0000Z/cice_bu651i_1d_21991001-21991101.nc

1.1T /work/n02/n02/eelrm/archive/u-bv669/20811001T0000Z/cice_bv669i_1d_20811001-20811101.nc

214G /work/n02/n02/eelrm/archive/u-bv020/21960101T0000Z/bv020a.da21960101_00

We have seen this on the odd occasion before where files end up much bigger for some reason. I would move these files out of the way and then retrigger the failed transfers.

Regards,
Ros.

comment:9 Changed 6 months ago by eelrm

Hi Ros,

Thanks - I've deleted those files and the transfer has succeeded.

Lauren

comment:10 Changed 6 months ago by ros

  • Resolution set to answered
  • Status changed from accepted to closed

Hi Lauren,

That's great. Thanks for letting us know.

I'll close this ticket now, but if you have further problems do let us know.

Regards,
Ros.

comment:11 Changed 3 months ago by eelrm

Hi Ros,

I'm having a few problems re-adding my Jasmin key on espp2. I can't get the ssh-setup to run to initialize a new SSH session, either manually or from my .profile or .bashrc. I managed to get it to work on espp1 by moving the ssh-setup code to my .bashrc from .profile, but am a little confused as to what the difference is and whether I should have done this. A job that has failed the pptransfer with permission denied is bz512.

Thanks,
Lauren

comment:12 Changed 3 months ago by ros

Hi Lauren,

Try removing the file ~/.ssh/environment.esPP002 and then log out and back into espp2 and it should then initialize a new ssh agent.

.profile & .bashrc is always a little confusing and to be totally honest I can never remember which one gets called under which circumstances. I would probably leave it where it was in your .profile

Regards,
Ros.

comment:13 Changed 3 months ago by eelrm

Hi Ros,

Success. Thanks!

Lauren

Note: See TracTickets for help on using tickets.