Opened 3 years ago
Closed 3 years ago
#2347 closed help (fixed)
Postproc/archiving to MASS errors
Reported by: | pmcjs | Owned by: | ros |
---|---|---|---|
Component: | Archiving | Keywords: | MASS, Moose, postproc |
Cc: | Platform: | Monsoon2 | |
UM Version: | 10.7 |
Description
Hello Helpdesk,
As of yesterday morning I started having problems with the archiving of my jobs on Monsoon. I think MASS was down briefly in the morning, postproc tried to archive my output but couldn't, but when it retried the command was failing (TSSC_CONFLICT_WITH_EARLIER_COMMAND) because the first postproc command had been sent. Here's an example from my job u-as061:
[chrsm@exvmsrose:~/cylc-run/u-as061/log/job/19300801T0000Z/postproc/01]$ less job.err
[WARN] [SUBPROCESS]: Command: moo test -sw moose:crum/u-as061 [SUBPROCESS]: Error = 5: test: (failed with code ERROR_SERVICE_UNAVAILABLE) service is accepting commands from admin only. //////////////////////////////////////////////////////////////////////// uk.gov.meto.moose.business.requesthandler.service.exceptions.SafeModeException: Accepting commands from Administrators only //////////////////////////////////////////////////////////////////////// test: failed (5) [WARN] [SUBPROCESS]: Command: moo mkset -v moose:crum/u-as061 [SUBPROCESS]: Error = 5: mkset (attempt 1 of 10): (failed with code ERROR_SERVICE_UNAVAILABLE) service is accepting commands from admin only. ////////////////////////////////////////////////////////////////////////
… and several more similar lines detailing each failed attempt.
The postproc app retried 9 times before finally failing. Each time the job.err showed something similar to this:
[chrsm@exvmsrose:~/cylc-run/u-as061/log/job/19300801T0000Z/postproc/09]$ less job.err
[WARN] [SUBPROCESS]: Command: moo put -f -vv /home/d04/chrsm/cylc-run/u-as061/share/data/History_Data/as061a.pm1930aug.pp moose:crum/u-as061/apm.pp [SUBPROCESS]: Error = 2: put command-id=469618317 failed: (SSC_TASK_REJECTION) one or more tasks are rejected. /home/d04/chrsm/cylc-run/u-as061/share/data/History_Data/as061a.pm1930aug.pp -> moose:/crum/u-as061/apm.pp/as061a.pm1930aug.pp: (TSSC_CONFLICT_WITH_EARLIER_COMMAND) command conflicts with another command. Conflicting command-ids: 469332925, put: failed (2) [WARN] moo.py: Moose Error: user-error (see Moose docs). (ReturnCode=2) File: /home/d04/chrsm/cylc-run/u-as061/share/data/History_Data/as061a.pm1930aug.pp [FAIL] main_pp.py - PostProc complete. Exiting with errors in atmos_archive [FAIL] Terminating PostProc... [FAIL] main_pp.py atmos # return-code=1 2017-12-20T08:53:13Z CRITICAL - Task job script received signal EXIT
The moose docs suggest trying moo cstat, but I get an "unknown application" error returned.
Would upgrading to postproc 2.1 solve this?
Thanks,
Chris
Change History (5)
comment:1 Changed 3 years ago by ros
- Owner changed from um_support to ros
- Status changed from new to accepted
comment:2 Changed 3 years ago by pmcjs
Hi Ros,
The data set exists:
chrsm@xcslc0:~> moo ls moose:/crum/u-as061 moose:/crum/u-as061/ada.file moose:/crum/u-as061/apd.pp moose:/crum/u-as061/apm.pp
Prior to the first time it failed, it managed to archive 20 years of model output successfully.
Strangely though, two of my three jobs (at524 and at525) have seemed to have started running again after a similar archiving failure. The original job (as061) has failed completely now. Should I try restarting this job again from Rose?
Thanks,
Chris
comment:3 Changed 3 years ago by ros
Hi Chris,
Yes, try restarting it.
Cheers,
Ros.
comment:4 Changed 3 years ago by pmcjs
Hi Ros,
I restarted from one of the non-archived startdumps and everything seems to be working now. You can close the ticket, thanks. I wonder if this was some random glitch with the archiving that prevented it working originally.
Thanks,
Chris
comment:5 Changed 3 years ago by willie
- Resolution set to fixed
- Status changed from accepted to closed
Hi Chris,
Can you try running moo ls moose:/crum/u-as061 to make sure that the set got created properly before it tried to archive anything? From my quick look I can't see that it created it successfully and I get permission denied when trying to ls it myself.
Regards,
Ros.