Opened 3 years ago

Closed 3 years ago

#2347 closed help (fixed)

Postproc/archiving to MASS errors

Reported by: pmcjs Owned by: ros
Component: Archiving Keywords: MASS, Moose, postproc
Cc: Platform: Monsoon2
UM Version: 10.7


Hello Helpdesk,

As of yesterday morning I started having problems with the archiving of my jobs on Monsoon. I think MASS was down briefly in the morning, postproc tried to archive my output but couldn't, but when it retried the command was failing (TSSC_CONFLICT_WITH_EARLIER_COMMAND) because the first postproc command had been sent. Here's an example from my job u-as061:

[chrsm@exvmsrose:~/cylc-run/u-as061/log/job/19300801T0000Z/postproc/01]$ less job.err

[WARN]  [SUBPROCESS]: Command: moo test -sw moose:crum/u-as061
[SUBPROCESS]: Error = 5:
        test: (failed with code ERROR_SERVICE_UNAVAILABLE) service is accepting commands from admin only.
//////////////////////////////////////////////////////////////////////// Accepting commands from Administrators only
test: failed (5)

[WARN]  [SUBPROCESS]: Command: moo mkset -v moose:crum/u-as061
[SUBPROCESS]: Error = 5:
        mkset (attempt 1 of 10): (failed with code ERROR_SERVICE_UNAVAILABLE) service is accepting commands from admin only.

… and several more similar lines detailing each failed attempt.

The postproc app retried 9 times before finally failing. Each time the job.err showed something similar to this:

[chrsm@exvmsrose:~/cylc-run/u-as061/log/job/19300801T0000Z/postproc/09]$ less job.err

[WARN]  [SUBPROCESS]: Command: moo put -f -vv /home/d04/chrsm/cylc-run/u-as061/share/data/History_Data/as061a.pm1930aug.pp moose:crum/u-as061/apm.pp
[SUBPROCESS]: Error = 2:
        put command-id=469618317 failed: (SSC_TASK_REJECTION) one or more tasks are rejected.
  /home/d04/chrsm/cylc-run/u-as061/share/data/History_Data/as061a.pm1930aug.pp -> moose:/crum/u-as061/apm.pp/as061a.pm1930aug.pp: (TSSC_CONFLICT_WITH_EARLIER_COMMAND) command conflicts with another command.
  Conflicting command-ids: 469332925,
put: failed (2)

[WARN] Moose Error: user-error (see Moose docs). (ReturnCode=2) File: /home/d04/chrsm/cylc-run/u-as061/share/data/History_Data/as061a.pm1930aug.pp
[FAIL] - PostProc complete. Exiting with errors in atmos_archive
[FAIL] Terminating PostProc...
[FAIL] atmos # return-code=1
2017-12-20T08:53:13Z CRITICAL - Task job script received signal EXIT

The moose docs suggest trying moo cstat, but I get an "unknown application" error returned.

Would upgrading to postproc 2.1 solve this?


Change History (5)

comment:1 Changed 3 years ago by ros

  • Owner changed from um_support to ros
  • Status changed from new to accepted

Hi Chris,

Can you try running moo ls moose:/crum/u-as061 to make sure that the set got created properly before it tried to archive anything? From my quick look I can't see that it created it successfully and I get permission denied when trying to ls it myself.


comment:2 Changed 3 years ago by pmcjs

Hi Ros,

The data set exists:

chrsm@xcslc0:~> moo ls moose:/crum/u-as061

Prior to the first time it failed, it managed to archive 20 years of model output successfully.

Strangely though, two of my three jobs (at524 and at525) have seemed to have started running again after a similar archiving failure. The original job (as061) has failed completely now. Should I try restarting this job again from Rose?


comment:3 Changed 3 years ago by ros

Hi Chris,

Yes, try restarting it.


comment:4 Changed 3 years ago by pmcjs

Hi Ros,

I restarted from one of the non-archived startdumps and everything seems to be working now. You can close the ticket, thanks. I wonder if this was some random glitch with the archiving that prevented it working originally.


comment:5 Changed 3 years ago by willie

  • Resolution set to fixed
  • Status changed from accepted to closed
Note: See TracTickets for help on using tickets.