Opened 3 months ago

Closed 3 months ago

#3034 closed help (answered)

Regular submit failures

Reported by: charlie Owned by: um_support
Component: UM Model Keywords:
Cc: Platform: NEXCS
UM Version: 10.7

Description

Hi,

Might there be a problem with nexcs (specifically xcslc1) at the moment, because for the last few days my postprocessing apps (e.g. postproc_atmos, postproc_nemo, postproc_cice and pptransfer) have all been failing at the submission stage, usually overnight. As soon as I trigger them myself, they run fine. Why are they not automatically submitting?

Thanks,

Charlie

Change History (12)

comment:1 Changed 3 months ago by ros

Hi Charlie,

Can you please give us the suite id(s)?

Cheers,
Ros.

comment:2 Changed 3 months ago by charlie

Sorry, yes of course: u-bm944, u-bm949, u-bm812, u-bn402 and u-bm296. They are all exhibiting the same symptom.

And yes, before I get a telling off for running too many suites at once, I realise I am running a lot, and I very much hope to kill at least 2 within the next couple of days.

Charlie

comment:3 Changed 3 months ago by ros

Hi Charlie,

It's the _mkstemp error (see log/suite/log) caused by the host being set to xcs-c. cylc doesn't like a suite ssh'ing from the xcs to the xcs and will frequently fail making the temp directory. Please change in the [[HPC]] section

host = $(rose host-select {{ HOST_XC40 }})

to be

host = localhost in site/meto_cray.rc.

Then reload the suite; rose suite-run --reload
Cheers,
Ros.

comment:4 Changed 3 months ago by charlie

Okay, I will do that now, many thanks.

But why has this only become an issue over the last couple of days, whereas it didn't happen once last week or the week before?

Charlie

comment:5 Changed 3 months ago by charlie

Further to this… I have now done as you suggested, and they are all queueing or running again. However, that can't be the only reason, because when I checked the meto_cray.rc files, 3 of them (namely u-bm812, u-bn402 and u-bm296) already had host = localhost, but was still suffering from the same symptom.

Charlie

comment:6 Changed 3 months ago by ros

Hi Charlie,

I don't know why it's reared it's head now, it is an intermittent problem that affects some people some of the time. They do not know why.

  • u-bn402 was the one suite I looked at and you have modified the meto_cray.rc file since then….
  • u-bm812, u-bm944 & u-bm949 are set up slightly differently, I recall from a previous query we were trying to get the build to go through and the host in several places…. It is still submitting the postproc & pptransfer tasks to xcs-c. Look in the cylc GUI and you can see what host tasks are running on. If these continue to have problems with the mkstemp error please change in the [[HPC_SERIAL]] section

host = xcs-c to host = localhost

  • u-bm296 failed to submit the coupled.18700101T0000Z with the mkstemp error and the log file says this was to host=xcs-c. I can see you have edited the meto_cray.rc file at 11:07 today, perhaps you had already changed this one…

Regards,
Ros.

comment:7 Changed 3 months ago by charlie

Thanks Ros, yes I changed both u-bn402 and u-bm296 this morning, since you looked at them, so they should be okay now. If the other 3 continue to have the same problem, I will change that line as you suggest.

Thanks for your help,

Charlie

comment:8 Changed 3 months ago by charlie

Hi again Ros,

I did what you suggested in the comment above, because the problem persisted with the other 3 suites as well.

However, is it possible that this has had an impact on my queueing time? I only ask because, today, the queueing time has dramatically increased. Over the last several weeks, my queueing time has has only a couple of minutes, yet literally all of my 5 suites have been now stuck in the queue since around 10 PM last night. I appreciate that the queueing time depends on other users, but it seems very coincidental that it should suddenly get so much worse. Or is there some other problem with the machine?

Charlie

comment:9 Changed 3 months ago by charlie

Further to the above…… there must be something wrong with the machine, because not only have ALL my suites suddenly shut down but I now cannot restart them. No doubt we will get the usual message imminently

Charlie

comment:10 Changed 3 months ago by ros

Hi Charlie,

There is maintenance on Monsoon/NEXCS today from 12:00

Please see the announcements on the Met Office Yammer. (I've copied part of it below)

Your suites are still sitting in the queue and will continue to run when the machine comes back.

Regards,
Ros


Dear All,

due to the ongoing need to identify and remove corrupt data from MASS, there will be no MOOSE client upgrade this week (originally planned for 10:00 Weds 9th October).

The file system work on XCS though will still go ahead, so Monsoon and NEXCS will still be unavailable from 12:00 BST onwards on Weds 9th, with the aim that the service will be back late in the afternoon/early evening.

comment:11 Changed 3 months ago by charlie

Many apologiess, I read the first half of that announcement when it was made, but clearly not the 2nd. Sorry.

comment:12 Changed 3 months ago by ros

  • Resolution set to answered
  • Status changed from new to closed
Note: See TracTickets for help on using tickets.