Opened 7 years ago

Closed 7 years ago

#1081 closed help (fixed)

Unknown problem in .leave file when submitting job

Reported by: charlie Owned by: um_support
Component: UM Model Keywords:
Cc: Platform: HECToR
UM Version: 6.6.3

Description

Hi,

I'm trying to submit a job on Hector, that I have copied from someone else but *believe* I have changed all the relevant sections. However, although it is submitted okay, it failed more or less straight away, producing the attached .leave file. As you can see, there are numerous errors but I'm not sure which one(s) is causing the problem?

Many thanks,

Charlie Williams

Attachments (6)

xioga000.xioga.d13161.t111212.leave (11.0 KB) - added by charlie 7 years ago.
.leave file for job xioga
xioga000.xioga.d13164.t160952.leave (192.7 KB) - added by charlie 7 years ago.
Next .leave file generated 13/6/13
xioga000.xioga.d13166.t154506.leave (208.8 KB) - added by charlie 7 years ago.
xioga000.xioga.d13175.t164433.leave (161.0 KB) - added by charlie 7 years ago.
xioga000.xioga.d13178.t151502.leave (160.6 KB) - added by charlie 7 years ago.
xioga000.xioga.d13182.t112133.leave (208.2 KB) - added by charlie 7 years ago.

Download all attachments as: .zip

Change History (27)

Changed 7 years ago by charlie

.leave file for job xioga

comment:1 follow-up: Changed 7 years ago by grenville

Hi Charlie

Not sure what's going on yet, but I did notice that you have

TARGET_MC=pathscale

in your profile. This should be changed to

TARGET_MC=cce

Grenville

comment:2 in reply to: ↑ 1 Changed 7 years ago by charlie

Dear Grenville,

I'm not sure if I'm meant to reply like this, so apologies if not!

I have changed my .profile as you suggested. Is this likely to have caused
the error?

Thanks,

Charlie

Replying to grenville:

Hi Charlie

Not sure what's going on yet, but I did notice that you have

TARGET_MC=pathscale

in your profile. This should be changed to

TARGET_MC=cce

Grenville

comment:3 Changed 7 years ago by grenville

Charlie

You have not yet created an executable - please navigate to Model Selection→Compilation and Modifications→Compile options for the model and select Compile and build the executable named below, then run. Please do the same for the reconfiguration - Model Selection→Compilation and Modifications→Compile options for the reconfiguration and select Compile and build the executable named below. Once you have executables for the model and reconfiguration, you can revert to your current settings.

Grenville

comment:4 Changed 7 years ago by charlie

Dear Grenville,

Many thanks, and apologies for not getting back to you straightaway - as you know, Hector was down most of yesterday.

Sorry for not spotting that error. I have now done as you suggested. However, after submitting my job again this morning, it appears to have failed that hasn't even created any output this time - the last available .leave file is the one I sent you before.

Sorry about this,

Charlie

comment:5 Changed 7 years ago by ros

Hi Charlie,

In UMUI window Sub model independent → FCM configuration → FCM Configuration variables you need to change "Target machine root extract directory (UM_ROUTDIR)" to be /home/n02/n02/cjrw09/um

Cheers,
Ros.

Changed 7 years ago by charlie

Next .leave file generated 13/6/13

comment:6 Changed 7 years ago by charlie

Dear Ros,

Very many thanks, and apologies for not spotting that.

I've now changed it as you suggested and have resubmitted my job, but it has again failed, generating the attached .leave file. I can see a number of errors, but not sure which one is causing it to fail or how to resolve it.

Apologies,

Charlie

comment:7 Changed 7 years ago by ros

Hi Charlie,

You need to build the reconfiguration executable. Go to UMUI window submodel independent → Compilations and modifications → modifications of the reconfiguration and switch to "compile and build the executable named below"

Cheers,
Ros.

Changed 7 years ago by charlie

comment:8 Changed 7 years ago by charlie

Dear Ros,

Very many thanks, and apologies for the delay in getting back to you.

I've now done as you suggested, and have resubmitted my job once again. It failed again, and I now have the attached output file - as before, I think I can see the error (to do with the stashmaster file, I think) but not sure how to resolve it?

Apologies again,

Charlie

comment:9 Changed 7 years ago by grenville

Charlie

The start file you are using contains things that the model is not expecting (PSTAR AFTER TIMSTEP for example). You can create a userstash file01001_ignore to instruct the model to ignore these items. I started to create on in /home/grenville/USERSTASH/01001_ignore, however, the start file also fails to have many fields that the model does expect (items 101,102,103,104…) It might be better if you could find a start file which more closely matched what the model required. Is that possible?

Grenville

comment:10 Changed 7 years ago by charlie

Dear Grenville,

Many thanks, and apologies for the delay.

I have now obtained another start dump, which I have been assured matches the job I'm trying to run. I've also resolved a couple of stash master errors (I think). However, upon submitting the job again, it again falls over and produces the attached output. I can see an error, saying it can't find a particular item in my stash (which I have checked and indeed it's not there) but not sure if this error is the relevant one. Would you mind taking another look at the attached output?

Thanks,

C

Changed 7 years ago by charlie

comment:11 Changed 7 years ago by grenville

Charlie

The compute nodes on HECToR can not see /home, so can not open the atmosphere start file. Please move that file to somewhere on /work. I did this an the atmosphere reconfiguration worked. However, the ocean start file /work/n02/n02/dh023729/result/xhgzgo.daa4410 does not exist on HECToR - please check with Liang where this file now resides.

Grenville

comment:12 Changed 7 years ago by charlie

Dear Grenville,

Many thanks. I have now done as you suggested: moved my atmosphere start dump to /work/n02/n02/cjrw09/dumps, copied the equivalent (i.e. same date) ocean start dump to the same directory, changed the relevant pathnames within my job and submitted it again. Please see attached for the latest output - I note the same error is still there, but given that this wasn't the problem you mentioned last time, I am now am not sure what the new problem is.

C

Changed 7 years ago by charlie

comment:13 Changed 7 years ago by grenville

Charlie

Somehow, use of user stashmaster files has been switched off. Please navigate to Model Selection→Atmosphere→STASH→User-STASHmaster… and check "Using user STASHmaster files for the Atmosphere"

Grenville

comment:14 Changed 7 years ago by charlie

Dear Grenville,

Right, that was actually me! The reason I turned them off is because, with them turned on, 7 individual "Broken code" errors are generated when trying to close the window. The errors are listed below. They all seem to be connected with ~umui/hadgem2/userstash/epflux606 but I have absolutely no idea what they mean. I asked Liang (whose job it is copied from) and he has never seen these errors. I asked him whether these could all be turned off without causing further problems, and he said yes.

Charlie

—-

User-STASHmaster file <~umui/hadgem2/userstash/epflux606> includes items with a broken grid code: 14. This is not supported without first providing a fix to the UM. See record RESIDUAL MN MERID. CIRC. VSTARBAR)

User-STASHmaster file <~umui/hadgem2/userstash/epflux606> includes items with a broken grid code: 14. This is not supported without first providing a fix to the UM. See record RESIDUAL MN MERID. CIRC. WSTARBAR)

User-STASHmaster file <~umui/hadgem2/userstash/epflux606> includes items with a broken grid code: 14. This is not supported without first providing a fix to the UM. See record ELIASSEN-PALM FLUX (MERID. COMPONENT)

User-STASHmaster file <~umui/hadgem2/userstash/epflux606> includes items with a broken grid code: 14. This is not supported without first providing a fix to the UM. See record ELIASSEN-PALM FLUX (VERT. COMPONENT)

User-STASHmaster file <~umui/hadgem2/userstash/epflux606> includes items with a broken grid code: 14. This is not supported without first providing a fix to the UM. See record DIVERGENCE OF ELIASSEN-PALM FLUX

User-STASHmaster file <~umui/hadgem2/userstash/epflux606> includes items with a broken grid code: 14. This is not supported without first providing a fix to the UM. See record MERIDIONAL HEAT FLUX

User-STASHmaster file <~umui/hadgem2/userstash/epflux606> includes items with a broken grid code: 14. This is not supported without first providing a fix to the UM. See record MERIDIONAL MOMENTUM FLUX

comment:15 Changed 7 years ago by grenville

Charlie

I just clicked OK on each of those errors and pressed on - the model runs OK. You could try removing ~umui/hadgem2/userstash/epflux606 from the list of user stash files [ I haven't tried running the model this way] - doing that prevents the error messages from appearing when closing the user stash window.

Grenville

comment:16 Changed 7 years ago by charlie

Dear Grenville,

Apologies for the delay. I have now done as you suggested, clicking ok on each of those errors and resubmitting my job. It has again fallen over, this time producing the attached output. What have I done wrong this time?

Thanks again,

Charlie

Changed 7 years ago by charlie

comment:17 Changed 7 years ago by grenville

Charlie

The error message in the leave file says it can't find the results directory - I'm not sure why it hasn't been created, please just mkdir /work/n02/n02/cjrw09/result and try running again.

/work/n02/n02/cjrw09/xioga/bin/qsexecute[741]: cd: /work/n02/n02/cjrw09/result: [No such file or directory]
47 /work/n02/n02/cjrw09/xioga/bin/qsexecute : cd to /work/n02/n02/cjrw09/result has failed

Grenville

comment:18 Changed 7 years ago by charlie

Dear Grenville,

Apologies for the delay on this. I have now made that directory as you suggested, and submitted my job 2 days ago, but nothing happened. Upon submitting it again today, I find that it is not even getting as far as Hector this time, hence no model output.

When I submit my job, I get the following:

FCM_MAIN: Calling Extract…
Base extract: failed
See extract output file
/home/charlie/um/um_extracts/xioga/umbase/ext.out
FCM_MAIN: Extract failed
Tidying up directories …

and then the whole thing freezes.

One thing which is different from before is that it is now asking me each time for my passphrase. It didn't ask for this before this week. I have entered my passphrase, or at least what I think is my passphrase, but perhaps it's wrong? But why is it asking for it in the first place? All I have done since successfully submitting a job (last week) is create that directory on Hector - nothing else has changed.

Many thanks,

Charlie

comment:19 Changed 7 years ago by grenville

Charlie

It looks like your puma-hector communications has got corrupted. Please go to your .ssh directory on PUMA and delete the file called "environment.puma". Logout and login again, type ssh-add, and enter your pass phrase. If you can ssh to HECToR from the PUMA command line, your UMUI sibmission should work again.

Grenville

comment:20 Changed 7 years ago by charlie

Dear Grenville,

Sorry for the delay once again on this.

I've now (or rather a couple of days ago) resubmitted my job, and I think it has worked. At least, I now have several files in my work/result directory:

xiogaa.dac94l0
xiogaa.dac9510
xiogaa.dac95b0

As well as exactly the same for the ocean output.

I started my job on 11/4/1929 and wanted it to run for a month (just to check everything works). It appears to have done this. My only slight surprise is that it has given me 3 output files every 10 days, with only one value (presumably 10 day average?) in each file. I was expecting to have daily, and sometimes hourly data, for at least some of my fields. In my stash, for example, total precipitation rate is being output every 3 hours, every daily maximum, every day and every dump. But I can't see these periods in my output.

Have I done something silly?

Charlie

comment:21 Changed 7 years ago by annette

  • Resolution set to fixed
  • Status changed from new to closed

Closing this ticket as it was dealt with offline.

Charlie just had to change the STASH settings to get the output he was expecting.

Annette

Note: See TracTickets for help on using tickets.