Opened 7 months ago

Closed 7 months ago

#3143 closed help (fixed)

u-bq683

Reported by: jonathan Owned by: um_support
Component: UM Model Keywords:
Cc: Platform:
UM Version:

Description

Dear helpdesk

I am running my first cylc job! It is a HadGEM3 FAFMIP experiment u-bq683 set up for me by Matt Couldrey. It ran five years OK (on nexcs, in cycles of three months), then it showed an error and "retrying" in the Coupled step, then vanished. Matt had a look at the log files (which I had not yet learned to find) and says it timed out. That is odd. How should I restart it? I tried "rose restart-suite" from ~/roses/u-bq683 and the command was executed, and some polite information issued, but the job did not restart. Where should I look next?

Thanks for your help

Jonathan

Change History (11)

comment:1 Changed 7 months ago by dcase

If you ssh to the xcs-c node with the -Y option, you should be able to use the graphical interface. If you run cylc gscan then a box should pop up with your suites, which you can click on. If you investigate this, you'll see that you can right click on tasks and 'trigger' them to run them again.

comment:2 Changed 7 months ago by grenville

Jonathan

I suggest that you increase the wallclock time, then

rose suite-run —restart

and then retrigger the retrying task (or the failed task.)

[the model ran out of wallclock time:

⇒> PBS: job killed: walltime 6305 exceeded limit 6300
aprun: Apid 95039843: Caught signal Terminated, sending to application
Application 95039843 is crashing. ATP analysis proceeding…
Terminated]

Grenville

comment:3 Changed 7 months ago by jonathan

Dear both

Thanks. I cannot resubmit because I have run out of disk space. I guess this is /working, where I am using 25 Gbyte. What is the quota?

Cheers

Jonathan

comment:4 Changed 7 months ago by ros

Hi Jonathan,

Can you post the full log that you get to the command line when you try and restart the suite please? As far a I am aware there are not individual quotas on /working.

Cheers,
Ros.

comment:5 Changed 7 months ago by jonathan

Dear Ros

Oh dear, I seem to be having elementary problems now! On nexcs

xcslc0$ pwd
/home/d01/hadsa/roses/u-bq683
xcslc0$ rose suite-run —restart

and it's just stuck in that. I thought yesterday that was the command I issued (following Grenville's email above) and it complained of running out of disk space, I think when gzipping old logs or something.

Jonathan

comment:6 Changed 7 months ago by ros

Hi Jonathan,

0900 - 1100 on Tuesdays is Monsoon's maintenance window during which time it is possible there may be some unadvertised disruption. The /projects disk is currently unavailable which is the cause of the hang you've just experienced. Advise wait until 11:00 to try again.

Regards,
Ros.

comment:7 Changed 7 months ago by ros

Hi Jonathan,

I have confirmation that there is a Lustre timeout issue affecting xcslc0. xcslc1 is not affected so suggest try logging into xcslc1 and try restarting the suite from there.

Cheers,
Ros.

comment:8 Changed 7 months ago by jonathan

In that case I get

xcslc1$ cd /home/d01/hadsa/roses/u-bq683
xcslc1$ rose suite-run —restart
[FAIL] [Errno 2] No such file or directory: '/home/d01/hadsa/cylc-run/u-bq683/log/rose-suite-run.conf'

It's quite right, there is no such file. What does that mean? Sorry to be so ignorant.

comment:9 Changed 7 months ago by grenville

Hi Jonathan

It appears that something odd happened at 16:43 on Jan 20, at which time a new log directory was created.

Please try this: delete log.20200120T164333Z and the link to it, then link "log" to log.20200116T143744Z, then from /home/d01/hadsa/roses/u-bq683, rose suite-run —restart (re-trigger the coupled task.)

If that doesn't work, I can only suggest running anew.

Grenville

comment:10 Changed 7 months ago by jonathan

Dear Grenville

That worked, thanks. The coupled task has been resubmitted and is now running
xcslc0$ qstat
Job id Name User Time Use S Queue


2225949.xcs00 coupled.1856070 hadsa 0 R normal

This time there were no remarks about disk space being full up either. Perhaps that was something to do with the glusterfs problem?

Cheers

Jonathan

comment:11 Changed 7 months ago by grenville

  • Resolution set to fixed
  • Status changed from new to closed
Note: See TracTickets for help on using tickets.