Opened 5 years ago
Closed 5 years ago
#1789 closed help (fixed)
UKCA release job reaching esceeding job time limit after changing STASH codes
Reported by: | dilshadshawki | Owned by: | annette |
---|---|---|---|
Component: | UM Model | Keywords: | UKCA, UMUI, time, archiving |
Cc: | Platform: | MONSooN | |
UM Version: | 8.4 |
Description
Dear Helpdesk,
The job xlzyj exceeds the job time limit with very little information in the .leave file:
/home/ds5912/ouput/xlzyj000.xlzyj.d16013.t174138.leave
I simply made a copy of a job that I know works: xlytb and then removed the STASH diagnostics and added my own ones in, but it only managed to output one monthly file:
/projects/ukca-imp/dshawk/xlzyj/xlzyja.pm1999dec
and then it crashed.
There is no pe_output folder, where in previous tickets there are errors to be found!
Can anyone help me with this issue?
Many thanks,
Dill
Change History (17)
comment:1 Changed 5 years ago by dilshadshawki
comment:2 Changed 5 years ago by annette
- Owner changed from um_support to annette
- Status changed from new to assigned
Hi Dill,
I think the reason that the pe_output files are not being written is that the variable $DATAW is set incorrectly in your job.
In the UMUI go to Input/output control and resources → Time convention and SCRIPT environment variables
It looks like a typo with "dhsawk" instead of "dshawk".
Annette
comment:3 Changed 5 years ago by annette
- Status changed from assigned to pending
comment:4 Changed 5 years ago by dilshadshawki
Hi Annette,
Thanks for pointing this out. I tried running it again after correcting the typo, but I still get the same issue where it exceeds the time limit:
/home/dshawk/output/xlzyj000.xlzyj.d16018.t172623.leave
But this time I do get the pe_output folder.
Although I must admit, the penny hasn't dropped as to which file I should search in to find an informative error message!
/projects/ukca-imp/dshawk/xlzyj/pe_output
Please help!
Many thanks,
Dill
comment:6 Changed 5 years ago by annette
- Status changed from pending to assigned
Dill,
From the pe_output files it looks like the model has finished running, but the scripts aren't finishing. This looks the same as ticket:1748#comment:8
I'm not sure what is causing the problem here, but will look into it.
Annette
comment:7 Changed 5 years ago by dilshadshawki
Hi Annette,
Many thanks for looking into this. I look forward to hearing from you.
Dill
comment:8 Changed 5 years ago by annette
Hi Dill,
I think the problem is that you have switched on post-processing but are missing the required branch:
fcm:um-br/dev/ros/vn8.4_MetoCray_arch/src
See here for full instructions:
http://collab.metoffice.gov.uk/twiki/bin/view/Support/CrayUMInstall#Archiving
If you have compilation switched off, you can re-build the scripts by selecting the option "Enable build UM scripts" in Compilation and Run options → UM Scripts Build.
I have tested this on Sunil's job (from ticket #1748 referenced above) which I think shows the same error.
Do let me know how you get on.
Annette
comment:9 Changed 5 years ago by annette
- Resolution set to fixed
- Status changed from assigned to closed
comment:10 Changed 5 years ago by dilshadshawki
Hi annette,
Apologies, shortly after our last message I went abroad. I've added the branch as you mentioned and I will let you know how I get on.
Cheers,
Dill
comment:11 Changed 5 years ago by dilshadshawki
Hi Annette,
I made the changes as you suggested, however I get an error in the dump reconfiguration
/home/dshawk/output/xlzyj000.xlzyj.d16036.t120303.rcf.leave
Any other ideas?
Cheers,
Dill
comment:12 Changed 5 years ago by annette
Hi Dill,
As you have probably seen there's nothing really helpful in the leave file about why it failed. And it looks as though since this failure (dated 5 Feb), you have successfully run the recon for this job (on 8 Feb). So it would be helpful if you can explain what you are trying to do and what you want help with?
From the 8 Feb run, it looks like the model recon and run have completed OK, but the archiving has failed as it seems to have run /before/ the model run. I can look into this if you like?
Best regards,
Annette
comment:13 Changed 5 years ago by dilshadshawki
Hi Annette,
You are correct that I did run it again on 8th Feb and I am amazed that I forgot I had done this, yet still sent off this ticket to you. Please forgive me, I will get my head screwed back on!
Yes please could you look into the why it didn't archive since I have set it up to archive to MASS via MOOSE on the UM setup, Post-Processing → Main Switch + General Questions and the previous issue you helped me with was to do with the archiving.
Could I do start the CRUN in the meantime?
Cheers,
Dill
comment:14 Changed 5 years ago by annette
- Resolution fixed deleted
- Status changed from closed to reopened
Hi Dill,
I took a copy of your job, and the archiving seems to work OK for me. However I have noticed that my scripts differ from the versions you are using, so I think you may need to re-build the scripts (see #comment:8).
Let me know if this works.
Annette
comment:15 Changed 5 years ago by annette
- Status changed from reopened to pending
comment:16 Changed 5 years ago by dilshadshawki
Hi Annette,
It seems that the archiving is working after all! I have just checked MASS/MOOSE and the files are being archived successfully!
Many thanks for your help on this!
Let's get this ticket closed shall we??
Best wishes,
Dill
comment:17 Changed 5 years ago by annette
Great glad it's working.
Annette
comment:18 Changed 5 years ago by annette
- Keywords time, archiving added; time removed
- Resolution set to fixed
- Status changed from pending to closed
Great glad it's working.
Annette
Sorry, above the first directory should say:
/home/dshawk/output/xlzyj000.xlzyj.d16013.t174138.leave