Opened 8 weeks ago

Closed 6 weeks ago

#3182 closed help (fixed)

Monsoon suite restart failure

Reported by: grenville Owned by: um_support
Component: UM Model Keywords: Rose Cylc monsoon
Cc: Platform: Monsoon2
UM Version:

Description

e-mail exchanges copied here for reference

I have tried what you suggested with a symlink and it has restarted OK without
disk I/O error. A pptransfer job has been resubmitted.

Hi Jonathan

I'd assumed you were in nexcs-n02 only, but I guess your "default" group is
ukesm (I'd not thought to ls -l in your cylc-run directory).

gmslis@xcslc0:/home/d01/hadsa/cylc-run> ls -la
total 8
drwxr-xr-x 2 hadsa mo_users 4096 Jan 16 14:37 .
drwxr-xr-x 14 hadsa mo_users 4096 Feb 4 08:08 ..
lrwxrwxrwx 1 hadsa mo_users 38 Jan 16 14:37 u-bq683 →
/projects/ukesm/hadsa/cylc-run/u-bq683

Monsoon has chosen ukesm for you. I'm not sure what to suggest - I have no
control over the ukesm project, but maybe its disk allocation could be
increased and you could continue as is.

You can change /home/d01/hadsa/.metomi/rose.conf to specify where the suite
should write to, add (for example)

|[rose-suite-run]
|
||root-dir|/=//projects/nexcs-n02/hadsa|

to write to nexcs-n02

However, in doing this, the suite will not restart because it will be
looking in the wrong place.

A possible way forward might be to "mv"
/projects/ukesm/hadsa/cylc-run/u-bq683 to
/projects/nexcs-n02/hadsa/cylc-run/u-bq683, set the u-bq683 link in the
cylc-run directory to point here (make the change in rose.conf above), then
rose suite-run —restart.

I've never done this before, so am not aware of pitfalls.

Failing either of these - I can only suggest to start a new suite in
nexcs-n02

:

The data is being written to /project/ukesm.

I think this page is telling me that ukesm is at, or very near, the
quota limit:

This suite failed with a disk I/O error and now refuses to restart (with the same error)
xcslc0$ pwd
/home/d01/hadsa/roses/u-bq683
xcslc0$ rose suite-restart
[FAIL] cylc restart u-bq683 # return-code=1, stderr=
[FAIL] Traceback (most recent call last):
[FAIL] File "/common/fcm/cylc-7.8.3/bin/cylc-restart", line 25, in <module>
[FAIL] main(is_restart=True)
[FAIL] File "/common/fcm/cylc-7.8.3/lib/cylc/scheduler_cli.py", line 134, in main
[FAIL] scheduler.start()
[FAIL] File "/common/fcm/cylc-7.8.3/lib/cylc/scheduler.py", line 237, in start
[FAIL] self.suite_db_mgr.restart_upgrade()
[FAIL] File "/common/fcm/cylc-7.8.3/lib/cylc/suite_db_mgr.py", line 524, in restart_upgrade
[FAIL] pri_dao.vacuum()
[FAIL] File "/common/fcm/cylc-7.8.3/lib/cylc/rundb.py", line 1031, in vacuum
[FAIL] return self.connect().execute("VACUUM")
[FAIL] sqlite3.OperationalError?: disk I/O error

Change History (1)

comment:1 Changed 6 weeks ago by grenville

  • Resolution set to fixed
  • Status changed from new to closed
Note: See TracTickets for help on using tickets.