Opened 5 weeks ago

Last modified 12 days ago

#2152 pending help

HadGEM2 qsserver failure

Reported by: jonathan Owned by: um_support
Priority: normal Component: UM Model
Keywords: Cc:
Platform: UM Version: <select version>

Description

Dear helpdesk

My HadGEM2 job failed overnight with a qsserver failure
/home/n02/n02/gregoryj/um/umui_out/xiqri000.xiqri.d17109.t103411.leave
Do you know what happened? It says

qshector_arch: ERROR ssh to machine espp1 failed
               copying files to /nerc disk failed
               archiving abandoned

so I wonder if it's because of quota in /nerc. I see in SAFE that the n02 rdf quota is nearly full - it is marked in red. But perhaps it's something else? Maybe I need rdf quota specifically for n02-FAFMIP, which apparently I don't have in SAFE?

Thanks

Jonathan

Change History (6)

comment:1 Changed 5 weeks ago by jonathan

The job did archive a small number of files to /nerc before crashing, but many of them remained in DATAW=/work/n02/n02/gregoryj/xiqri. I am puzzled. Jonathan

comment:2 Changed 5 weeks ago by grenville

Jonathan

There is no need for a FAFMIP quota on /nerc - and there is space available on /nerc, so the problem lies elsewhere. There is some strange activity on the serial machines at the moment resulting in jobs running very slowly - why that might cause an ssh failure is not apparent (perhaps not relevant). We are investigating.

Grenville

comment:3 Changed 5 weeks ago by willie

Hi Jonathan,

There are some issues with archiving: we're working on this with ARCHER at the moment.

You build job for xiqri has failed:

r: /work/n02/n02/gregoryj/xiqri/ummodel/tmp/lib__fcm__xiqri.a: No space left on device
fcm_internal load failed (256)
gmake: *** [xiqri.exe] Error 1
gmake -f /work/n02/n02/gregoryj/xiqri/ummodel/Makefile -j 1 -s all failed (2) at /fs2/y07/y07/umshared/software/fcm-2016.12.0/bin/../lib/FCM1/Build.pm line 611
Build failed on Tue Apr 18 15:57:37 2017.
->Make: 110 seconds
Model build: failed

Regards
Willie

comment:4 Changed 5 weeks ago by jonathan

Dear Willie

Thanks. I presume that I can't run my job until the archiving is fixed somehow, can I, because there is very little space on /work. Do you expect it to be fixed soon?

I hadn't noticed the build had failed. The job which crashed ran later than that, using an executable I had already made successfully. Thanks for pointing this out. I had other failures too because I had filled my /work quota, which is 100 GiB, I see from SAFE. Is that the right place to look for quotas? The quota command says I don't have quotas.

Best wishes

Jonathan

comment:5 Changed 5 weeks ago by grenville

Jonathan

I have increased your /work quota to 1TB (it may take a few hrs for ARCHER to process the change).

Please use

lfs quota -uh <yourid> /work

to see your /work quota and usage.

Grenville

comment:6 Changed 12 days ago by willie

  • Status changed from new to pending
Note: See TracTickets for help on using tickets.