Opened 2 years ago

Closed 2 years ago

#2055 closed help (answered)

compiled jobs not running

Reported by: Emre Owned by: willie
Component: UM Model Keywords:
Cc: Platform: ARCHER
UM Version: 8.4

Description

Hello Helpdesk,
In the last few weeks I came across some strange situations.

1) The jobs which are copies of perfectly running jobs do compile well but never run. I can't understand why this happens, because ..comp.leave and …rcf.leave files seem OK. Here are some examples in the /home/output directory.

emre n02 13881421 Jan 16 18:17 xmufv000.xmufv.d17016.t170839.comp.leave
emre n02 311383 Jan 16 19:07 xmufv000.xmufv.d17016.t170839.rcf.leave

Note : the above 2year xmufv job is 21-month job which never ever run although it is almost identical with the 9-month job (i.e., xmufu) which runs fine.
2) Even more strange situations I have experienced many times are those jobs which run on some days and complain in other days.

For instance the job xncga was running well a few days ago.
On Jan 15 the compilation failed

(emre n02 3821 Jan 15 17:48 xncga000.xncga.d17015.t174206.comp.leave) with the complaint:

"" BEGIN failed—compilation aborted at /fs2/y07/y07/umshared/software/fcm-2016.10.0/bin/../lib/FCM/CLI.pm line 26.

Compilation failed in require at /work/y07/y07/umshared/software/fcm/bin/fcm line 25 ""

This makes no sense I am already under umshared package and the same job used to run well.

On Jan 16 the same job decided to run again (without any complaints).


On the night of Jan 16 the same job failed again in the compilation stage.
emre n02 1440 Jan 17 01:15 xncga000.xncga.d17017.t004855.comp.leave
/home/n02/n02/emre/umui_runs/xncgb-017005338/umuisubmit_compile[42]: .: /work/y07/y07/umshared/bin/loadcomp: cannot open [Is a directory]

Can you help me resolve these issues ?
Thank you


Change History (5)

comment:1 Changed 2 years ago by Emre

  • Platform set to ARCHER

comment:2 Changed 2 years ago by willie

  • Owner changed from um_support to willie
  • Status changed from new to accepted

Hi Emre,

I've been having similar problems. I've put in a query to ARCHER.

Regards
Willie

comment:3 Changed 2 years ago by Emre

Thank you Willie
Did they tell you when they will fix it.
When should we expect to get an answer from them?
The same problems seem to continue. Archer resist to run any "normal run" job that is longer than a year. And I dont understand why. A fresh example
emre n02 13881435 Jan 18 11:57 xmufw000.xmufw.d17017.t142125.comp.leave
emre n02 311308 Jan 18 12:04 xmufw000.xmufw.d17017.t142125.rcf.leave
successful compilation and reconfiguration but the computation does not start and the job quits without any .leave file.


comment:4 Changed 2 years ago by willie

Hi Emre,
ARCHER have posted the following

You may currently be experiencing some problems with the ARCHER external login nodes and pre/post processing nodes. 

This recent instability has been identified as a Lustre filesystem deadlock condition, which occurs intermittently during certain bulk read operations. The nature of this problem has been identified and Engineering teams are working hard to have a Lustre client fix available by the end of the month. In parallel, we are investigating the possibility of a workaround.

We will continue to monitor this issue and it will be discussed at the next ARCHER Management Board Meeting. 

We will keep users informed and update you with further information as soon as it becomes available. 

We hope that there will be an improvement soon and we thank you for your patience.

You can sign up for these messages on the ARCHER SAFE web site.

Regards
Willie

comment:5 Changed 2 years ago by willie

  • Resolution set to answered
  • Status changed from accepted to closed
Note: See TracTickets for help on using tickets.