Opened 4 years ago

Closed 4 years ago

Last modified 4 years ago

#1818 closed help (fixed)

Submitting job but it doesn;t go into Q

Reported by: simon.tett Owned by: ros
Component: UM Model Keywords:
Cc: Platform: ARCHER
UM Version: 7.8

Description

Hi,

I have a job (xlwt#r) which when I submit does not go into the Q. It tells me it is extracting and puts a directory on archer (latest one is xlwtr-062105629). qstat shows no job and I can't see any output…

I can submit the compile script from that directory and then get another error — [FAIL] /work/n02/n02/stett2/um/xlwtr/umbase/cfg/bld.cfg: cannot locate config fi
le, abort at /fs2/n02/n02/hum/software/fcm-2015.12.0/bin/../lib/FCM1/ConfigSyste
m.pm line 539

But I think that is a symptom of some other problem..

I think I am doing some thing dumb… Any suggestions for what I should do.

Simon

Change History (15)

comment:1 Changed 4 years ago by ros

  • Owner changed from um_support to ros
  • Status changed from new to accepted

Hi Simon,

I think if you look in the UMUI job submission output window you will find an error message saying that the extract failed.

Looking in the extract output file on PUMA: /home/simon.tett/um/um_extracts/xlwtr/umbase/ext.out it failed to ssh to ARCHER.

You need to makes sure you have ssh-agent setup correctly so that you can ssh from PUMA to ARCHER without the need to enter your passphrase/password.

We also recommend you compile your job in /home rather than /work as it is much quicker.
In the UMUI panel FCM configuration → FCM extract and build directories set Target machine root directory (UM_ROUTDIR) to /home/n02/n02/stett2/um

Regards,
Ros.

comment:2 Changed 4 years ago by simon.tett

Hi Roz,

thanks. I thought it was something dumb!

I have to confess I don't really understand ssh traeting it as a set of magic incantations
I am running .ssh/setup when I tried this
ssh-add didn't work but some googling got me the following:
simon.tett@puma:~> eval ssh-agent -s
Agent pid 2403
simon.tett@puma:~> ssh-add
Identity added: /home/simon.tett/.ssh/id_rsa (/home/simon.tett/.ssh/id_rsa)
Enter passphrase for /home/simon.tett/.ssh/id_dsa:
Identity added: /home/simon.tett/.ssh/id_dsa (/home/simon.tett/.ssh/id_dsa)
But when I restarted the UMUI and tried the extract got a failure…
Simon

comment:3 Changed 4 years ago by ros

Hi Simon,

Ideally you should be calling .ssh/setup from your ~/.profile so that the ssh-agent is automatically initialised, as required, when you login to PUMA. Then you don't have to run it manually. I see that you don't appear to have a .profile at all. I would suggest copying our standard .profile which is at ~um/um-training/setup/.profile to $HOME/.profile

Then all you should need to run is ssh-add. The ssh agent should keep running even when you log out of PUMA, however you may need to restart it from time to time. Sometimes this step will fail with the following error:

Could not open a connection to your authentication agent

Instructions on recovering from this is in our FAQ: http://cms.ncas.ac.uk/wiki/FAQ

The job submission is failing now because you've set the Target machine root directory to /home/simon.tett/um rather than /home/n02/n02/stett2/um. You need to set the full path explicitly and not use local variables like $HOME as this will be expanded on PUMA.

Regards,
Ros.

Last edited 4 years ago by ros (previous) (diff)

comment:4 Changed 4 years ago by simon.tett

That fixed it! and thanks for the FAQ reminder.

Simon

comment:5 Changed 4 years ago by simon.tett

Probably should add a new ticket but..
Run is compiling but puts exec and reconfig in /home/n02/n02/stett2/um/xlwtr/ummodel/bin and /home/n02/n02/stett2/um/xlwtr/umrecon/bin.

and not in $DATAM/$RUNID.exec & $DATAM/$RUNID.exec which is what I asked for.

Sure I'm missing something in my setup…

comment:6 Changed 4 years ago by ros

Hi Simon,

The execs and scripts are initially put in the respective bin directories in $HOME and then they are copied over to their final resting place. Execs usually live in $DATAW/bin. I'm wondering whether trying to put them somewhere else has confused it. There are 2 failed mkdir commands at the end of the comp.leave file. I notice all the UM scripts were indeed successfully copied over to $DATAW/bin.

Rather than sending off the compile again just to try and get it to move the execs, I would just copy the execs over manually to where you want them.

Cheers,
Ros.

comment:7 Changed 4 years ago by simon.tett

Hi Roz,

thanks. As I'm trying to set up a test bed for modifying the code I've set the directory for both binaries to $DATAW/bin. Hopefully, that will do the trick. Otherwise every time I change the model I'll need to make the binaries over by hand… I wonder if I should have created $DATAM by hand first…

Simon

comment:8 Changed 4 years ago by ros

Hi Simon,

Yes, do let me know if this doesn't fix it and I will investigate further. It should create the $DATAW/$DATAM directories automatically. So if you do end up having to do that there is something wrong.

Regards,
Ros.

comment:9 Changed 4 years ago by simon.tett

yes that worked. Now have .exec and .recon in bin directory. And run job sitting in standard q.

Simon

comment:10 Changed 4 years ago by simon.tett

Final question I hope.. Job is in standard q awaiting reconfiguration. Can I run reconfig in the serial Q in an attempt to reduce Q wait time?

ta
Simon

comment:11 Changed 4 years ago by simon.tett

The sad sage of my job continues. It has reconfigured but fails to submit the run job. I think I've asked the model to run for 10 days. And looking through the UMUI I think that is what I've done. I'm resetting the start time in the dump to the model basis (start?) time. There are some error messages in the reconfig output but qsexecute returns status 0 and there is an .astart file…

comment:12 Changed 4 years ago by ros

Hi Simon,

In answer to your first question, you can't run the recon in the serial queue, but if it's running on less than 8 nodes (192 cores) and takes less than 20minutes to complete you could send it to the short (debug) queue. Change #PBS -q standard to #PBS -q short in umuisubmit_rcf.

The run job has been submitted it is sitting in the queue. It was submitted at 20:29 last night just after the recon finished.

ARCHER-xc30> qstat -u stett2     
sdb: 
                                                            Req'd  Req'd   Elap
Job ID          Username Queue    Jobname    SessID NDS TSK Memory Time  S Time
--------------- -------- -------- ---------- ------ --- --- ------ ----- - -----
3548292.sdb     stett2   standard xlwtr_run     --    2  48    --  12:00 Q   --

Cheers,
Ros.

comment:13 Changed 4 years ago by simon.tett

duh! Thanks… I'll modify whole job to run in short Q — I have a script edit for that.

comment:14 Changed 4 years ago by ros

  • Resolution set to fixed
  • Status changed from accepted to closed

comment:15 Changed 4 years ago by ros

  • Platform changed from HECToR to ARCHER
Note: See TracTickets for help on using tickets.