Opened 4 years ago

Closed 4 years ago

#1544 closed help (answered)

UMv7.7 crun fail on Monsoon

Reported by: sara123fenech Owned by: ros
Component: UM Model Keywords: nrun/crun
Cc: Platform: MONSooN
UM Version: 7.7

Description

Hi

I am trying to run a job having id xlhca which is a copy of a working one. I have run the nrun successfully for 1 month resulting in xlhca.fort6.pe1 and similar outputs.

However when I changed back the handedit to Y for the crun as well as the Compilation and Modification settings to run from existing executable, my crun failed and I can no longer find the nrun output:

xlhca000.xlhca.d15104.t164708.leave

It seems as though all outputs of the nrun are being overwritten and not found once the crun is executed.

Any suggestions as to what could be the reason?

Thanks
Sara

Change History (8)

comment:1 Changed 4 years ago by ros

  • Owner changed from um_support to ros
  • Status changed from new to accepted

Hi Sara,

You have moose archiving switched on which has failed as you don't have a moose account set up, so the CRUN has failed.

I would suggest switching off archiving for now by going to UMUI window Post-processing → Main switch & General Questions and selecting No archiving system. Also in this window you will see options to delete superseded files - you may or may not want these switched on. These are useful if you want dumps or PP-files that have been superseded (i.e. they are out of date because another has been written) to be deleted.

Also switch off the UM scripts build in window Compilation & Modifications → UM Scripts build by deselecting Enable build of UM scripts.

Resubmit the job as an NRUN and then retry switching on the CRUN hand-edit.

Regards,
Ros.

comment:2 Changed 4 years ago by sara123fenech

Dear Ros

Thanks for your reply.

I have changed the options you suggested, to switch off archiving however now not even the nrun is running. The nrun is taking very long to terminate (this should only take a maximum of 3hrs). Any idea why this is the case?

The 2 leave files which are produced are:

2105280 Apr 22 13:35 xlhca000.xlhca.d15112.t104456.leave
2314 Apr 22 13:37 xlhca000.xlhca.d15112.t110733.leave

Could it be that I also have to switch off some hand edits?
Also how can I create a moose account please?

Regards
Sara

comment:3 Changed 4 years ago by ros

Hi Sara,

I'm a little confused the .leave files you list above indicate that the job run fine through to completion in less than 3hours.

xlhca000.xlhca.d15112.t104456.leave - Ran for 30 days and completed in ~2.5 hours
xlhca000.xlhca.d15112.t110733.leave - Ran for 31 days and completed in ~2.5 hours

If you need to archive to MASS you will need to contact the MONSooN team to request an account be setup for you. Please email monsoon@…

Regards,
Ros.

comment:4 Changed 4 years ago by sara123fenech

Hi Ros

Sorry for not explaining clearly what's happening.

So I switched off archiving as suggested and also switched off the crun handedit. I also set the compilation and modifications options to compile and run.

When I submit this nrun, I get all these .leave files

-rw-r—r—. 1 sfenec users 4917197 Apr 22 11:06 xlhca000.xlhca.d15112.t104456.comp.leave
-rw-r—r—. 1 sfenec users 2105280 Apr 22 13:35 xlhca000.xlhca.d15112.t104456.leave
-rw-r—r—. 1 sfenec users 1725429 Apr 22 16:12 xlhca000.xlhca.d15112.t110733.leave
-rw-r—r—. 1 sfenec users 1647036 Apr 22 19:15 xlhca000.xlhca.d15112.t133747.leave
-rw-r—r—. 1 sfenec users 1673641 Apr 22 21:51 xlhca000.xlhca.d15112.t164359.leave
-rw-r—r—. 1 sfenec users 1733189 Apr 23 00:28 xlhca000.xlhca.d15112.t191759.leave
-rw-r—r—. 1 sfenec users 1799432 Apr 23 03:00 xlhca000.xlhca.d15112.t215358.leave
-rw-r—r—. 1 sfenec users 1761339 Apr 23 05:34 xlhca000.xlhca.d15113.t003024.leave
-rw-r—r—. 1 sfenec users 1661931 Apr 23 08:06 xlhca000.xlhca.d15113.t030229.leave
-rw-r—r—. 1 sfenec users 3955 Apr 23 09:29 xlhca000.xlhca.d15113.t053459.leave

The job was still running but I stopped it as I assumed that the nrun should only run for a month as indicated in the umui (i.e for just 2.30 hrs and not resubmitting).

What is confusing me is why are so many .leave files produced for the nrun and why am I not getting any .pm output.

Hope it is more clear now.

Thanks
Sara

comment:5 Changed 4 years ago by ros

Hi Sara,

Ah ok - I'm with you now! :-) Looking at the job I think this is due to the hand-edit
~hadwr/jobfiles/xjqgo/hand_edits/setup_Glob_resub.ed It looks like this is doing the resubmission. Please try switching this off and resubmitting the NRUN. Hopefully that will fix it and the job will stop after the first month.

The reason you are not seeing a build up of .pm files is because you have Delete superseded restart dumps, PP files and Climate means files switched on in the Post-processing → Main switch & General Questions window.

Regards,
Ros.

comment:6 Changed 4 years ago by sara123fenech

Hi Ros

Thanks a lot for your help. The suggested changes have worked, however I now encountered another problem. The crun has been running and stopped after 3 years of simulation time. The job duration was set to 6 years and 1 month.

I have tried to look at the last .leave file however couldn't figure out what went wrong.
The file is:
xlhca035.xlhca.d15124.t095140.leave

Thanks again
Sara

comment:7 Changed 4 years ago by grenville

Sara

Your job is running with a 365 day calendar and all the ancillary files are on the 360 day calendar. This may be the source of the problem. You can change the model to use the 360 day calendar (Input Output Control..→General Config and control). You will need to reconfigure the start file to set the header information correclty. That inconsistency may be the cause of the problem. Has this job been run successfully by the MO?

Grenville

comment:8 Changed 4 years ago by ros

  • Resolution set to answered
  • Status changed from accepted to closed

No further comment has been made on this ticket in the last few weeks and is now being closed.

Note: See TracTickets for help on using tickets.