Opened 3 weeks ago

Last modified 2 weeks ago

#3505 new help

Archive Integrity Issue: Dataset incomplete - holes present

Reported by: jmw240 Owned by: um_support
Component: UM Model Keywords:
Cc: Platform:
UM Version: 11.5

Description

Hi,

I'm running a vn11.5 PI AMIP job on Monsoon (u-cb739) with 3 month cycling. At the end of the first 3 month cycle period, atmos_main, postproc and housekeeping were all succeeded but archive_integrity failed with the error message below, even after I tried several retrigger efforts.

[WARN] file:atmospp.nl: skip missing optional source: namelist:archer_arch
[WARN] file:pptransfer.nl: skip missing optional source: namelist:archer_arch
[WARN] file:atmospp.nl: skip missing optional source: namelist:script_arch
[WARN] file:pptransfer.nl: skip missing optional source: namelist:pptransfer
[WARN] Collection ap4.pp - Unexpected files in the archive:

cb739a.p41850apr.pp
cb739a.p41850aug.pp
cb739a.p41850jul.pp
cb739a.p41850jun.pp
cb739a.p41850mar.pp
cb739a.p41850may.pp
cb739a.p41850nov.pp
cb739a.p41850oct.pp
cb739a.p41850sep.pp

[WARN] Collection ap5.pp - Unexpected files in the archive:

cb739a.p51850apr.pp
cb739a.p51850aug.pp
cb739a.p51850jul.pp
cb739a.p51850jun.pp
cb739a.p51850mar.pp
cb739a.p51850may.pp
cb739a.p51850nov.pp
cb739a.p51850oct.pp
cb739a.p51850sep.pp

[WARN] Collection ap6.pp - Unexpected files in the archive:

cb739a.p618500321.pp
cb739a.p618500401.pp
cb739a.p618500411.pp
cb739a.p618500421.pp
cb739a.p618500501.pp
cb739a.p618500511.pp
cb739a.p618500521.pp
cb739a.p618500601.pp
cb739a.p618500611.pp
cb739a.p618500621.pp
cb739a.p618500701.pp
cb739a.p618500711.pp
cb739a.p618500721.pp
cb739a.p618500801.pp
cb739a.p618500811.pp
cb739a.p618500821.pp
cb739a.p618500901.pp
cb739a.p618500911.pp
cb739a.p618500921.pp
cb739a.p618501001.pp
cb739a.p618501011.pp
cb739a.p618501021.pp
cb739a.p618501101.pp
cb739a.p618501111.pp
cb739a.p618501121.pp
cb739a.p618501201.pp
cb739a.p618501211.pp

[WARN] Collection ap7.pp - Unexpected files in the archive:

cb739a.p718500321.pp
cb739a.p718500401.pp
cb739a.p718500411.pp
cb739a.p718500421.pp
cb739a.p718500501.pp
cb739a.p718500511.pp
cb739a.p718500521.pp
cb739a.p718500601.pp
cb739a.p718500611.pp
cb739a.p718500621.pp
cb739a.p718500701.pp
cb739a.p718500711.pp
cb739a.p718500721.pp
cb739a.p718500801.pp
cb739a.p718500811.pp
cb739a.p718500821.pp
cb739a.p718500901.pp
cb739a.p718500911.pp
cb739a.p718500921.pp
cb739a.p718501001.pp
cb739a.p718501011.pp
cb739a.p718501021.pp
cb739a.p718501101.pp
cb739a.p718501111.pp
cb739a.p718501121.pp
cb739a.p718501201.pp
cb739a.p718501211.pp

[WARN] Collection ap8.pp - Unexpected files in the archive:

cb739a.p818500321.pp
cb739a.p818500401.pp
cb739a.p818500411.pp
cb739a.p818500421.pp
cb739a.p818500501.pp
cb739a.p818500511.pp
cb739a.p818500521.pp
cb739a.p818500601.pp
cb739a.p818500611.pp
cb739a.p818500621.pp
cb739a.p818500701.pp
cb739a.p818500711.pp
cb739a.p818500721.pp
cb739a.p818500801.pp
cb739a.p818500811.pp
cb739a.p818500821.pp
cb739a.p818500901.pp
cb739a.p818500911.pp
cb739a.p818500921.pp
cb739a.p818501001.pp
cb739a.p818501011.pp
cb739a.p818501021.pp
cb739a.p818501101.pp
cb739a.p818501111.pp
cb739a.p818501121.pp
cb739a.p818501201.pp
cb739a.p818501211.pp

[WARN] Collection apa.pp - Unexpected files in the archive:

cb739a.pa1850apr.pp
cb739a.pa1850aug.pp
cb739a.pa1850jul.pp
cb739a.pa1850jun.pp
cb739a.pa1850mar.pp
cb739a.pa1850may.pp
cb739a.pa1850nov.pp
cb739a.pa1850oct.pp
cb739a.pa1850sep.pp

[WARN] Collection apb.pp is unexpectedly present in the archive.
[WARN] Collection apc.pp is unexpectedly present in the archive.
[WARN] Collection apd.pp - Unexpected files in the archive:

cb739a.pd1850apr.pp
cb739a.pd1850aug.pp
cb739a.pd1850jul.pp
cb739a.pd1850jun.pp
cb739a.pd1850mar.pp
cb739a.pd1850may.pp
cb739a.pd1850nov.pp
cb739a.pd1850oct.pp
cb739a.pd1850sep.pp

[WARN] Collection ape.pp - Unexpected files in the archive:

cb739a.pe1850apr.pp
cb739a.pe1850aug.pp
cb739a.pe1850jul.pp
cb739a.pe1850jun.pp
cb739a.pe1850mar.pp
cb739a.pe1850may.pp
cb739a.pe1850nov.pp
cb739a.pe1850oct.pp
cb739a.pe1850sep.pp

[WARN] Collection apf.pp is unexpectedly present in the archive.
[WARN] Collection apg.pp is unexpectedly present in the archive.
[WARN] Collection api.pp is unexpectedly present in the archive.
[WARN] Collection apj.pp is unexpectedly present in the archive.
[WARN] Collection apk.pp - Unexpected files in the archive:

cb739a.pk1850apr.pp
cb739a.pk1850aug.pp
cb739a.pk1850jul.pp
cb739a.pk1850jun.pp
cb739a.pk1850mar.pp
cb739a.pk1850may.pp
cb739a.pk1850nov.pp
cb739a.pk1850oct.pp
cb739a.pk1850sep.pp

[WARN] Collection apm.pp - Unexpected files in the archive:

cb739a.pm1850apr.pp
cb739a.pm1850aug.pp
cb739a.pm1850jul.pp
cb739a.pm1850jun.pp
cb739a.pm1850mar.pp
cb739a.pm1850may.pp
cb739a.pm1850nov.pp
cb739a.pm1850oct.pp
cb739a.pm1850sep.pp

[WARN] Collection aps.pp is unexpectedly present in the archive.
[WARN] Collection apu.pp - Unexpected files in the archive:

cb739a.pu1850apr.pp
cb739a.pu1850aug.pp
cb739a.pu1850jul.pp
cb739a.pu1850jun.pp
cb739a.pu1850mar.pp
cb739a.pu1850may.pp
cb739a.pu1850nov.pp
cb739a.pu1850oct.pp
cb739a.pu1850sep.pp

[WARN] Collection ap9.pp is missing from the archive.
[FAIL] Dataset incomplete - holes present in moose:crum/u-cb739
[FAIL] Terminating PostProc?
[FAIL] archive_integrity.py # return-code=1
2021-03-30T21:14:23Z CRITICAL - failed/EXIT

Even with the archive integrity set as failed, the model continued running successfully until it reached the end of the year where it didn't fail but rather simply stopped (it was supposed to run for several decades). Eventually I set archive integrity to 'succeeded' as a test and this allowed the run to continue from the end of the year.

This issue looks similar to that of #3062 but I'm a bit confused as I am intending to output files on UP4 and UP5 etc and looking at the contents of MASS, there appears to be data in these streams.

Does the fact that atmos_main and postproc appear to be working fine mean I can just let this model run continue or is this a more serious problem?

Thanks for your help,

James

Change History (5)

comment:1 Changed 2 weeks ago by jmw240

Hi,

Just an update: the model now stops after every 3 month cycling period as the postproc and next period's atmos_main get stuck on "submit-retrying". Retriggering them gets things going again but it's not ideal as I have to keep an eye on the run all the time and I wonder if it is related to the issue mentioned above?

Thanks,

James

comment:2 Changed 2 weeks ago by grenville

James

We suggest switching off archive integrity and setting all failed archive integrity tasks to succeeded. No guarantees 'though.

The submit-retrying problem should go away if you change

[remote?]

host = $(rose host-select xcs-c)

to

[remote?]

host = localhost

Grenville

comment:3 Changed 2 weeks ago by jmw240

Thanks, Grenville. After making the change from $(rose host-select xcs-c) to localhost in the monsoon.rc file, how should I restart the run? Would it be rose suite-run —restart?

Best,

James

comment:4 Changed 2 weeks ago by grenville

Hi James

I'd set the status of the retrying task to failed, then rose suite-run —reload and retrigger the failed task (that's probably not the only way)

Grenville

comment:5 Changed 2 weeks ago by jmw240

Thanks, Grenville.

Note: See TracTickets for help on using tickets.