Opened 4 years ago
Closed 4 years ago
#2152 closed help (fixed)
HadGEM2 qsserver failure
Reported by: | jonathan | Owned by: | um_support |
---|---|---|---|
Component: | UM Model | Keywords: | |
Cc: | Platform: | ||
UM Version: | <select version> |
Description
Dear helpdesk
My HadGEM2 job failed overnight with a qsserver failure
/home/n02/n02/gregoryj/um/umui_out/xiqri000.xiqri.d17109.t103411.leave
Do you know what happened? It says
qshector_arch: ERROR ssh to machine espp1 failed copying files to /nerc disk failed archiving abandoned
so I wonder if it's because of quota in /nerc. I see in SAFE that the n02 rdf quota is nearly full - it is marked in red. But perhaps it's something else? Maybe I need rdf quota specifically for n02-FAFMIP, which apparently I don't have in SAFE?
Thanks
Jonathan
Change History (7)
comment:1 Changed 4 years ago by jonathan
comment:2 Changed 4 years ago by grenville
Jonathan
There is no need for a FAFMIP quota on /nerc - and there is space available on /nerc, so the problem lies elsewhere. There is some strange activity on the serial machines at the moment resulting in jobs running very slowly - why that might cause an ssh failure is not apparent (perhaps not relevant). We are investigating.
Grenville
comment:3 Changed 4 years ago by willie
Hi Jonathan,
There are some issues with archiving: we're working on this with ARCHER at the moment.
You build job for xiqri has failed:
r: /work/n02/n02/gregoryj/xiqri/ummodel/tmp/lib__fcm__xiqri.a: No space left on device fcm_internal load failed (256) gmake: *** [xiqri.exe] Error 1 gmake -f /work/n02/n02/gregoryj/xiqri/ummodel/Makefile -j 1 -s all failed (2) at /fs2/y07/y07/umshared/software/fcm-2016.12.0/bin/../lib/FCM1/Build.pm line 611 Build failed on Tue Apr 18 15:57:37 2017. ->Make: 110 seconds Model build: failed
Regards
Willie
comment:4 Changed 4 years ago by jonathan
Dear Willie
Thanks. I presume that I can't run my job until the archiving is fixed somehow, can I, because there is very little space on /work. Do you expect it to be fixed soon?
I hadn't noticed the build had failed. The job which crashed ran later than that, using an executable I had already made successfully. Thanks for pointing this out. I had other failures too because I had filled my /work quota, which is 100 GiB, I see from SAFE. Is that the right place to look for quotas? The quota command says I don't have quotas.
Best wishes
Jonathan
comment:5 Changed 4 years ago by grenville
Jonathan
I have increased your /work quota to 1TB (it may take a few hrs for ARCHER to process the change).
Please use
lfs quota -uh <yourid> /work
to see your /work quota and usage.
Grenville
comment:6 Changed 4 years ago by willie
- Status changed from new to pending
comment:7 Changed 4 years ago by willie
- Resolution set to fixed
- Status changed from pending to closed
The job did archive a small number of files to /nerc before crashing, but many of them remained in DATAW=/work/n02/n02/gregoryj/xiqri. I am puzzled. Jonathan