Opened 8 years ago

Closed 8 years ago

#967 closed help (fixed)

in model archiving for HadGEM2-CC

Reported by: swr07dmm Owned by: ros
Component: UM Model Keywords:
Cc: Platform: MONSooN
UM Version: 6.6.3

Description

I am currently trying to set up archiving for HadGEM2-CC using the 'post-processing' table of the umui. I have followed the online instructions on how to do this, and added the relevant FCM command for HadGEM2 (which I obtained from Jeff Coles). However, when my goes to archive the first month it crashes (job xhjjc). I think the problem may be that I am a member of multiple groups on MONSooN (CLPREDIC, SOLAR, SRSIMS) and when it comes to the 'mkset' command it does not seem to recognise my group, and so crashes. I may be wrong on this of course, but it is definitely something to do with the archiving which is causing the crash.

thanks,
Dann

Change History (9)

comment:1 Changed 8 years ago by ros

Hi Dann,

That's weird as the variable PROJECTGROUP is being set to "srsims" correctly in PPCNTL.

I've taken a copy of your job and will hopefully be able to see where it's getting lost.

Regards,
Ros.

comment:2 Changed 8 years ago by ros

Hi Dann,

I've tracked down the problem and have put the required fix into Jeff's archiving branch.

Hopefully just doing a UM scripts build will correctly pick up the changes. Go to subindep→ Compilation and modifications → UM Scripts build and switch on "enable build of UM scripts"

Regards,
Ros.

comment:3 Changed 8 years ago by swr07dmm

Thanks Ros,

At what stage should I do the UM scripts build? i.e. during the compilation stage, reconfiguration or running? Presumably I should just do it for one of these stages then switch off "enable build of UM scripts" ? Or should I do it somewhere else then perform all 3 of the above stages with the button switched off?

thanks,
Dann

comment:4 Changed 8 years ago by ros

Hi Dann,

You had the job setup to NRUN, executable already compiled so I assumed that you simply just wanted to run and not recompile. If that's the case, just switch on "enable UM scripts build" and resubmit the NRUN. It'll rebuild the scripts before it does anything else be that reconfiguration or running the model. Then switch it off again for subsequent runs.

If you are needing to recompile then don't do anything the recompilation of the model exec includes a build of the UM scripts automatically.

Hope that makes sense.
Regards,
Ros.

comment:5 Changed 8 years ago by swr07dmm

Hi Ros,

The corrections you put in seem to work fine for the CLPREDIC group, but will not allow me to do the same on the SOLAR group, the job crashes after attempting to archive again. Unlike the problem before (which was to do with the mkset command) this problem seems to be due to 'moo put' command. I was attempting to use it on my job id xhjjd, and the output seems to show a moo fail (/home/dmitch/output/xhjjd000.xhjjd.d12328.t170259.leave). I guess one of the strange things is that if I look at the permissions of the solar directory on mass (moo ls -l moose:/crum/sol*) it seems to belong to sarah ineson at the metoffice. I do no know if that has anything to do with the trouble!

thanks,
Dann

comment:6 Changed 8 years ago by ros

  • Owner changed from um_support to ros
  • Status changed from new to accepted

Hi Dann,

I've spoken to AJ Watling at the Met Office and he has directed this problem to the Storage team at the Met Office. They should contact you directly, however, if I hear anything I'll obviously update this ticket.

Regards,
Ros.

comment:7 Changed 8 years ago by swr07dmm

Hi Ros, there still seems to be something amiss here. My jobs which are using in-model archiving a crashing regularly, it the .leave etc files are not that helpful (see run xhjjd (and to a lesser extend xhjjc)). The model seems to crash after about 1 model year at the moment, with the error seeming to be '/projects/solar/dmitch/um/xhjjd/bin/qsresubmit: Error job not resubmitted because of server failure'

However I think it might still be related to the archiving, because it always seems to leave 3 or 4 months unachieved before stopping.

Is it a feature of the model that if it fails to archive a certain number of months it stops?

thanks,

Dann

comment:8 Changed 8 years ago by ros

Hi Dann,

This looks like another MASS problem so I have forwarded your query to the Met Office.

Regards,
Ros.

comment:9 Changed 8 years ago by ros

  • Resolution set to fixed
  • Status changed from accepted to closed

Response received from Met office:

I have been investigating this problem for you. We believe this issue to be with networking issues on this day. Please can you resubmit your job and attempt to overwrite the file.

qsserver: Fri Dec  7 06:19:36 GMT 2012:  xhjjda.pf83dec ARCHIVE PPNOCHART
qsmoose: arguments passed are:
  xhjjda.pf83dec dmitch /home/dmitch 6.6.3 CRUN crum hadmass@hc0800 /home/dmitch/umui_runs/xhjjd-341105132 solar
Checked that set moose:crum/xhjjd exists
The command to archive file is:
"moo put -f -vv -c=umpp /projects/solar/dmitch/um/xhjjd/xhjjda.pf83dec moose:crum/xhjjd/apf.pp/xhjjda.pf83dec.pp"
OK, return code = 0
xhjjda.pf83dec was added to the apf.pp
qscasedisp: return code after calling qsmoose RCARC=0
MOOSE: Successfull xhjjda.pf83dec time taken 5 seconds.
qsserver: Fri Dec  7 06:19:41 GMT 2012:  xhjjda.pf83dec DELETE
xhjjda.pf83dec deleted


Many Thanks,
Ian
Ian Randall BSc Hons, MBCS, Storage Analyst

Note: See TracTickets for help on using tickets.