Opened 4 years ago

Closed 4 years ago

#2152 closed help (fixed)

HadGEM2 qsserver failure

Reported by: jonathan Owned by: um_support
Component: UM Model Keywords:
Cc: Platform:
UM Version: <select version>


Dear helpdesk

My HadGEM2 job failed overnight with a qsserver failure
Do you know what happened? It says

qshector_arch: ERROR ssh to machine espp1 failed
               copying files to /nerc disk failed
               archiving abandoned

so I wonder if it's because of quota in /nerc. I see in SAFE that the n02 rdf quota is nearly full - it is marked in red. But perhaps it's something else? Maybe I need rdf quota specifically for n02-FAFMIP, which apparently I don't have in SAFE?



Change History (7)

comment:1 Changed 4 years ago by jonathan

The job did archive a small number of files to /nerc before crashing, but many of them remained in DATAW=/work/n02/n02/gregoryj/xiqri. I am puzzled. Jonathan

comment:2 Changed 4 years ago by grenville


There is no need for a FAFMIP quota on /nerc - and there is space available on /nerc, so the problem lies elsewhere. There is some strange activity on the serial machines at the moment resulting in jobs running very slowly - why that might cause an ssh failure is not apparent (perhaps not relevant). We are investigating.


comment:3 Changed 4 years ago by willie

Hi Jonathan,

There are some issues with archiving: we're working on this with ARCHER at the moment.

You build job for xiqri has failed:

r: /work/n02/n02/gregoryj/xiqri/ummodel/tmp/lib__fcm__xiqri.a: No space left on device
fcm_internal load failed (256)
gmake: *** [xiqri.exe] Error 1
gmake -f /work/n02/n02/gregoryj/xiqri/ummodel/Makefile -j 1 -s all failed (2) at /fs2/y07/y07/umshared/software/fcm-2016.12.0/bin/../lib/FCM1/ line 611
Build failed on Tue Apr 18 15:57:37 2017.
->Make: 110 seconds
Model build: failed


comment:4 Changed 4 years ago by jonathan

Dear Willie

Thanks. I presume that I can't run my job until the archiving is fixed somehow, can I, because there is very little space on /work. Do you expect it to be fixed soon?

I hadn't noticed the build had failed. The job which crashed ran later than that, using an executable I had already made successfully. Thanks for pointing this out. I had other failures too because I had filled my /work quota, which is 100 GiB, I see from SAFE. Is that the right place to look for quotas? The quota command says I don't have quotas.

Best wishes


comment:5 Changed 4 years ago by grenville


I have increased your /work quota to 1TB (it may take a few hrs for ARCHER to process the change).

Please use

lfs quota -uh <yourid> /work

to see your /work quota and usage.


comment:6 Changed 4 years ago by willie

  • Status changed from new to pending

comment:7 Changed 4 years ago by willie

  • Resolution set to fixed
  • Status changed from pending to closed
Note: See TracTickets for help on using tickets.