Opened 9 months ago

Closed 7 months ago

Last modified 7 months ago

#3323 closed help (fixed)

Problems with post-processing on Monsoon

Reported by: aschurer Owned by: ros
Component: UM Model Keywords:
Cc: Platform: Monsoon2
UM Version: 11.0

Description (last modified by ros)

Hi,
I'm running a version of the UKESM on Monsoon2.
exp u-bv174
It seems to have run OK for a few cycles for about 1 year but has failed on the post-processing part:

postproc_atmos
/home/d05/aschurer/cylc-run/u-bv174/log/job/08500101T0000Z/postproc_atmos/04/job.err

[ERROR]  Validity time mismatch in file /home/d05/aschurer/cylc-run/u-bv174/share/data/History_Data/bv174a.p40850feb to be archived
[FAIL]  Command Terminated
[FAIL] Terminating PostProc...
[FAIL] main_pp.py atmos # return-code=1
2020-07-14T07:31:49Z CRITICAL - failed/EXIT

postproc_cice
/home/d05/aschurer/cylc-run/u-bv174/log/job/08500101T0000Z/postproc_cice/04/job.err

[ERROR]  concat_daily_means: Cannot create month of daily means as only got 1 files:
['bv174i.10d_24h.0850-02-01-00000.nc']
[FAIL]  Command Terminated
[FAIL] Terminating PostProc...
[FAIL] main_pp.py cice # return-code=1
2020-07-14T07:37:25Z CRITICAL - failed/EXIT

I can't see any obvious problem as the data all appear to be there.
I'd be extremely grateful if you could tell me what the problem could be.
Is it likely to be a problem with the STASH? Or with the post-processing setup?

Many thanks,
Andrew

Change History (18)

comment:1 Changed 9 months ago by ros

  • Owner changed from um_support to ros
  • Status changed from new to accepted

Hi Andrew,

Little bit confused what's going on with cice as the postproc is expecting 3 10d files to create the mean from but there is one for everyday hence the error message. Have you changed anything with cice or the post-processing from the original suite? If not did you make sure the original suite ran ok before you made your changes?

I'm trying to run u-bv174 could please give me read access to: /projects/ukesm/aschurer/ancils/OZONE/mmro3_monthly_CMIP6_849-1050_N96_edited.

Cheers,
Ros.

comment:2 Changed 9 months ago by ros

  • Description modified (diff)

comment:3 Changed 9 months ago by aschurer

Hi Ros,
Thanks for looking into this.
I've now hopefully given read permissions on all my ancillaries. Please let me know if you can't access anything else.
I changed some of the STASH calls and changed the forcings by changing ancillary files and some of the nameslists.
As far as I know I did not change anything to do with cice nor the post-processing commands (except changing sci-tools to umtools.
I didn't actually run the original suite before modifying it, which was probably a mistake- but have previously copied and run with it a couple of years ago - so I'm fairly sure it works fine. Which is why I'm unsure what could have gone wrong.

Thanks,
Andrew

comment:4 Changed 9 months ago by ros

Hi Andrew,

I can see what's going on with postproc_cice - it's not liking the 3 digit year. Rather than 0850 it's pattern matching on 850 and thus failing to find any of the files. I will take a look at the newer postproc version and see if it's fix there otherwise I will need to create a branch for you. I haven't looked yet, but I suspect the postproc_atmos will be a similar year issue.

Cheers,
Ros.

comment:5 Changed 9 months ago by ros

Hi Andrew,

I've managed to make a fix for you. Please include the branch branches/dev/rosalynhatcher/postproc_2.2_three_digit_year_fix@3649 in fcm_make_pp → configuration → pp_sources and then reload or restart the suite. Then retrigger the fcm_make_pp and fcm_make2_pptasks to rebuild the postprocessing scripts.

I've run a couple of cycles and the postprocs appear to work fine now.

Regards,
Ros.

comment:6 Changed 9 months ago by aschurer

Hi Ros,
Thanks for creating this branch. I included as suggested above.
I got in a bit of mess re-triggering the job, so decided to delete the work directory and start from the beginning (hope that was OK).
The job now stops at the archive_integrity task with the following error:


[SUBPROCESS]: Error = 255:
	Traceback (most recent call last):
  File "/usr/lib64/python2.6/runpy.py", line 122, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/usr/lib64/python2.6/runpy.py", line 34, in _run_code
    exec code in run_globals
  File "/common/fcm/rose-2019.01.3/lib/python/rose/date.py", line 413, in <module>
    main()
  File "/common/fcm/rose-2019.01.3/lib/python/rose/date.py", line 351, in main
    _print_duration(date_time_oper, opts, args)
  File "/common/fcm/rose-2019.01.3/lib/python/rose/date.py", line 388, in _print_duration
    duration, sign = date_time_oper.date_diff(time_point_1, time_point_2)
  File "/common/fcm/rose-2019.01.3/lib/python/rose/date.py", line 225, in date_diff
    if time_point_2 < time_point_1:
  File "/common/fcm/rose-2019.01.3/lib/python/isodatetime/data.py", line 1341, in __cmp__
    other_date = other.get_calendar_date()
  File "/common/fcm/rose-2019.01.3/lib/python/isodatetime/data.py", line 884, in get_calendar_date
    self.day_of_year)
  File "/common/fcm/rose-2019.01.3/lib/python/isodatetime/data.py", line 1805, in get_calendar_date_from_ordinal_date
    raise ValueError("Bad ordinal date: %s-%03d" % (year, day_of_year))
ValueError: Bad ordinal date: 8500-401

This looks like it could be linked to the three digit-year..? Could you advise how to fix this.

Many thanks,
Andrew

Last edited 9 months ago by ros (previous) (diff)

comment:7 Changed 9 months ago by ros

Hi Andrew,

Yes it's definitely passing the rose command the wrong date. :-( I'll take look and get back to you with a fix.

Cheers,
Ros.

comment:8 Changed 9 months ago by ros

Hi Andrew,

I've updated the branch. You'll just need to change the revision number to 3669 in the fcm_make_pp → configuration → pp_sources panel.

Then do rose suite-run --reload

Rretrigger the fcm_make_pp, fcm_make2_pp tasks and finally the archive_integrity task.

Cheers,
Ros.

comment:9 Changed 9 months ago by aschurer

Hi Ros,

It's unfortunately still failing (although with a different error).
It still looks linked to the problem of having a three digit year though (see below for contents of the .err file for the archive integrity.

Out of interest how do you re-trigger a task which has become clear with a dashed line (i.e. the completed fcm_make tasks) as I could not work out how to do this as you suggested - so started from the beginning again.

Many thanks,
Andrew

[WARN] file:atmospp.nl: skip missing optional source: namelist:archer_arch
[WARN] file:nemocicepp.nl: skip missing optional source: namelist:archer_arch
[WARN] file:pptransfer.nl: skip missing optional source: namelist:archer_arch
[WARN] file:pptransfer.nl: skip missing optional source: namelist:pptransfer
[WARN]  Collection oni.nc.file - Unexpected files in the archive:
	bv174o_trajectory_icebergs_08500101-08500401.nc
[WARN]  Collection ind.nc.file is missing from the archive.
[WARN]  Collection inm.nc.file is missing from the archive.
[WARN]  Collection oni.nc.file - Files missing from the archive:
	bv174o_trajectory_icebergs_8500101-8500401.nc
[WARN]  Collection onm.nc.file is missing from the archive.
[FAIL]  Dataset incomplete - holes present in moose:crum/u-bv174
[FAIL] Terminating PostProc...
[FAIL] archive_integrity.py # return-code=1
2020-07-28T15:54:09Z CRITICAL - failed/EXIT
Last edited 9 months ago by ros (previous) (diff)

comment:10 Changed 9 months ago by ros

Hi Andrew,

Sorry missed that one in amongst my debug statements. :-( Branch updated with revision 3670.

Reload and retrigger the fcm_make(2)_pp tasks.

I don't know what state the task was in in your cylc GUI - it should be grey succeeded. If when you right click on the fcm_make_pp task the "trigger(run now)" is greyed out, try changing the task state to failed and then retrigger.

The archive_integrity task does still fail with a couple of collections missing. Doesn't look date related. You might want to check that it has archived what you are expecting. It may be that running archive_integrity app may not be appropriate - at postproc_2.2 it is only experimental. If you need me to take a look as to why it is expecting to find ind, inm & onm collections let me know.

Cheers,
Ros.

comment:11 Changed 9 months ago by aschurer

Hi Ros,
Thanks again for your help with this.

A previous job I set up based on the same experiment (but with a 4-digit year) archived these so I'm fairly sure it should be doing this and should have the files it needs.

I've had a look through the log files
e.g. /home/d05/aschurer/cylc-run/u-bv174/log/job/08500101T0000Z/postproc_cice/NN/job.out
and it seems to me to be a problem with the function "create_means" for cice as it is not producing the output that a previous job did.
e.g. compare to /home/d05/aschurer/cylc-run/u-bf095/log/job/19501001T0000Z/postproc_cice/NN/job.out

Could it be another problem with a function failing to find the model files to process due to a three figure year?

Thanks,
Andrew

comment:12 Changed 9 months ago by ros

Hi Andrew,

Ah yes, I found a problem with the renaming of some of the cice files which I've just fixed. I've now taken a copy of your suite and will run it overnight and compare to your other suite output to see if there's anything more outstanding and check if archive_integrity works. I'm not going to hold my breath. ;-)

Cheers,
Ros.

comment:13 Changed 9 months ago by ros

Hi Andrew,

Good news. I've run the first 2 cycles successfully and archive_integrity is happy everything that should be in MASS is there.

Revision number for the postproc branch is now: 3677

Cheers,
Ros.

comment:14 Changed 8 months ago by aschurer

Hi Ros,

The job managed to run successfully through several cycles ~10 years.

However unfortunately the archive_integrity has failed, and I can't see an issue with the files in the archive.

The error message looks strange as it seems to be looking for files that can't exist, so again it looks like it may be an issue with the dates:
/home/d05/aschurer/cylc-run/u-bv174/log/job/08600101T0000Z/archive_integrity/NN/job.err

[WARN] Collection oni.nc.file - Unexpected files in the archive:

bv174o_trajectory_icebergs_08501001-08510101.nc
bv174o_trajectory_icebergs_08511001-08520101.nc
bv174o_trajectory_icebergs_08521001-08530101.nc
bv174o_trajectory_icebergs_08531001-08540101.nc
bv174o_trajectory_icebergs_08541001-08550101.nc
bv174o_trajectory_icebergs_08551001-08560101.nc
bv174o_trajectory_icebergs_08561001-08570101.nc
bv174o_trajectory_icebergs_08571001-08580101.nc
bv174o_trajectory_icebergs_08581001-08590101.nc
bv174o_trajectory_icebergs_08591001-08600101.nc

[WARN] Collection apy.pp is missing from the archive.
[WARN] Collection oni.nc.file - Files missing from the archive:

bv174o_trajectory_icebergs_08501001-08500101.nc
bv174o_trajectory_icebergs_08511001-08510101.nc
bv174o_trajectory_icebergs_08521001-08520101.nc
bv174o_trajectory_icebergs_08531001-08530101.nc
bv174o_trajectory_icebergs_08541001-08540101.nc
bv174o_trajectory_icebergs_08551001-08550101.nc
bv174o_trajectory_icebergs_08561001-08560101.nc
bv174o_trajectory_icebergs_08571001-08570101.nc
bv174o_trajectory_icebergs_08581001-08580101.nc
bv174o_trajectory_icebergs_08591001-08590101.nc

[FAIL] Dataset incomplete - holes present in moose:crum/u-bv174

Thanks again for all your help,
Andrew

comment:15 Changed 8 months ago by ros

Hi Andrew,

Yes looks like another date issue with the yearly/decadals. It's not easy to catch them all without running the whole simulation. Trying to avoid having to run it for 10 years but suspect I may have to. I'll take a look on Monday.

Cheers,
Ros.

comment:16 Changed 8 months ago by ros

Hi Andrew,

So thankfully I didn't have to run out a decade and could use your MASS set.

So we had 2 problems here:

  1. The trajectory_icebergs date was a typo on my part telling it to use the current year not the next year! I've fixed that in the branch at rev 3693. So for future new runs please use that revision. As it's only a one line change rather than rebuilding the scripts just copy my file into your suite's cylc-run directory:
    cp ~rhatcher/expected_content.py ~/cylc-run/u-bv174/share/fcm_make_pp/build/bin
    

Retrigger the failed archive_integrity task. That will get rid of the oni.nc.file warnings.

  1. The missing apy.pp set is more complicated. The seasonal and yearly means are created by a separate script contained within your suite not within postproc. So basically you have not created any yearly means as the create_means script in every cycle didn't think it had anything to do courtesy of yet another 3-digit year issue. :-(

Assuming you want the seasonal & yearly means, you're going to have to manually re-insert the create_means task into every October cycle up to the current point and retrigger them.

The following worked on my test suite:

Copy my modified create_means.py script to your suite and its corresponding cylc-run directory:

cp ~rhatcher/create_means.py ~/roses/u-bv174/bin
cp ~rhatcher/create_means.py ~/cylc-run/u-bv174/bin

On the command line reinsert the create_means task:

cylc insert u-bv174 create_means.08501001T0000Z

The old cycle should reappear at the top with the create_means task in the waiting state (there can be a short delay before it appears). Right click and "Trigger (run now)".

Check the job.out log file before proceeding to make sure it has created the seasonal means for the first year and put them into MASS (moo ls moose:crum/u-bv174/aps.pp).

Repeat the cylc insert command for each year up to your current cycle. Note the task only needs to be run in the October cycles. You should find it will create the apy.pp collection in the 08511001T0000Z cycle.

Cheers,
Ros.

comment:17 Changed 7 months ago by ros

  • Resolution set to fixed
  • Status changed from accepted to closed

Hi Andrew,

I assume this is all working now and so will close this ticket.

Regards,
Ros.

comment:18 Changed 7 months ago by aschurer

Hi Ros,
Yes this is now working.
Many thanks for all your help,
Andrew

Note: See TracTickets for help on using tickets.