#3289 closed help (fixed)

Multiple failures since NEXCS return

Reported by: charlie Owned by: um_support
Component: UM Model Keywords:
Cc: Platform: NEXCS
UM Version: 10.7

Description

Hi,

Sorry to bother you, but ever since the NEXCS failure over the last couple of days and it's return this morning, some of my tasks have not been submitting/running and giving me errors I haven't seen before. My suite is br871 and has been stable for over 300 model years so far, so I can't believe this is a scientific problem.

Currently, my coupled stage of year 2217 has failed, giving me the following error:

BUFFIN: Read Failed: Inappropriate ioctl for device

????????????????????????????????????????????????????????????????????????????????
???!!!???!!!???!!!???!!!???!!!       ERROR        ???!!!???!!!???!!!???!!!???!!!
?  Error code: 22
?  Error from routine: io:buffin
?  Error message: Error in buffin errorCode= 0.00 len=17408/28160
?  Error from processor: 0
?  Error number: 13
????????????????????????????????????????????????????????????????????????????????

Likewise, my postproc_nemo task of the preceding year has also failed, giving me the following error:

[WARN] file:atmospp.nl: skip missing optional source: namelist:moose_arch
[WARN] file:nemocicepp.nl: skip missing optional source: namelist:moose_arch
Traceback (most recent call last):
  File "/home/d05/cwilliams/cylc-run/u-br871/share/fcm_make_pp/build/bin/main_pp.py", line 123, in <module>
    main()
  File "/home/d05/cwilliams/cylc-run/u-br871/share/fcm_make_pp/build/bin/main_pp.py", line 116, in main
    run_postproc()
  File "/home/d05/cwilliams/cylc-run/u-br871/share/fcm_make_pp/build/bin/main_pp.py", line 80, in run_postproc
    getattr(model, meth)()
  File "/projects/nexcs-n02/cwilliams/cylc-run/u-br871/share/fcm_make_pp/build/bin/timer.py", line 115, in wrapper
    out = function(*args, **kw)
  File "/projects/nexcs-n02/cwilliams/cylc-run/u-br871/share/fcm_make_pp/build/bin/modeltemplate.py", line 757, in create_means
    fn_full)
  File "/projects/nexcs-n02/cwilliams/cylc-run/u-br871/share/fcm_make_pp/build/bin/modeltemplate.py", line 1156, in fix_mean_time
    do_bounds=self.naml.processing.correct_time_bounds_variables)
  File "/projects/nexcs-n02/cwilliams/cylc-run/u-br871/share/fcm_make_pp/build/bin/netcdf_utils.py", line 134, in fix_times
    mean_time_var[:] = correct_time(meanset, time_var, units, calendar)
  File "/projects/nexcs-n02/cwilliams/cylc-run/u-br871/share/fcm_make_pp/build/bin/netcdf_utils.py", line 109, in correct_time
    dates_in = [time_var_to_date(fname, time_var) for fname in meanset]
  File "/projects/nexcs-n02/cwilliams/cylc-run/u-br871/share/fcm_make_pp/build/bin/netcdf_utils.py", line 103, in time_var_to_date
    date = num2date(time[:], time_units, time_cal)
  File "utils.pyx", line 252, in netCDF4.num2date (netCDF4.c:5968)
  File "/opt/python/gnu/2.7.9/lib/python2.7/site-packages/netCDF4-1.1.5-py2.7-linux-x86_64.egg/netcdftime/netcdftime.py", line 833, in num2date
    date = _DateFrom360Day(jd)
  File "/opt/python/gnu/2.7.9/lib/python2.7/site-packages/netCDF4-1.1.5-py2.7-linux-x86_64.egg/netcdftime/netcdftime.py", line 477, in _DateFrom360Day
    (F, Z) = math.modf(JD)
TypeError: only length-1 arrays can be converted to Python scalars
[FAIL] main_pp.py nemo # return-code=1
2020-06-05T09:58:08Z CRITICAL - failed/EXIT

What do either of these mean?

Thanks,

Charlie

Change History (11)

comment:1 Changed 10 months ago by grenville

Charlie

It appears that a partial sum file has gone wrong.
I can only suggest that you rerun from 22160101T0000Z. You will need to organise start files and history file as described in https://code.metoffice.gov.uk/trac/moci/wiki/tips_CRgeneral (Restarting Failing Suites) — I think you have done that before. Then

rose suite-run --warm 22160101T0000Z

If this doesn't work, I can only suggest starting a new suite from a suitable point and sort out climate means manually.

Grenville

comment:2 Changed 10 months ago by charlie

Hi Grenville,

Sorry, do you mean the "Restarting from archived restarts" section, rather than the "Restarting Failing Suites" section? I have done the former many many times, but I have not done the latter ever before, and it doesn't seem relevant to my particular error. Or am I not understanding something?

Charlie

comment:3 Changed 10 months ago by grenville

Hi Charlie

Sorry the slow reply.

I did mean Restarting Failing Suites, but as you can no doubt tell, I am not certain that the suite will restart - when partial sum files go out of sync, it's hard to tell.

Starting a new suite from a suitable point in the past might the better alternative. You should be able to judge at what point in time to start from looking at what diagnostics have been written succesfully. The suite is configured to create decadal means - you might need to create those from monthly or seasonal means rather than backing up too far.

Grenville

comment:4 Changed 10 months ago by charlie

Hi Grenville,

Very many thanks, that's what I will do, I think.

Given that I am not particularly interested in decadal means, and even if I was can easily recreate these manually at a later date, then can I not just go back to the previous year and restart from there? In other words, the model failed at year 2216, so can I not go back to 2215 (saving the original version of this, so I don't get a gap) and then run from there? I appreciate this might mean that this particular decadal mean is not calculated correctly, but as I said that doesn't really matter to me, as long as it runs from the technical side.

Charlie

comment:5 Changed 10 months ago by grenville

Hi Charlie

Yes, that's probably OK. Bear in mind seasonal and yearly means too.

(thanks for the new key)

Grenville

comment:6 Changed 10 months ago by charlie

Hi Grenville,

Sorry for the delay in getting back to you. I have now done as you suggested, rewinding to 2215 (the year before the suite failed) and restarting from this. It successfully completed 2215, but has yet again failed at 2216 (where it failed before) giving me the following error:

atpAppSigHandler: Back-end never delivered its pid. Re-raising signal.
atpAppSigHandler: Back-end never delivered its pid. Re-raising signal.
atpAppSigHandler: Back-end never delivered its pid. Re-raising signal.
atpAppSigHandler: Back-end never delivered its pid. Re-raising signal.
_pmiu_daemon(SIGCHLD): [NID 03680] [c5-1c0s8n0] [Mon Jun 15 23:09:58 2020] PE RANK 1010 exit signal Aborted
atpAppSigHandler: Back-end never delivered its pid. Re-raising signal.
[NID 03680] 2020-06-15 23:09:58 Apid 108351778: initiated application termination
[FAIL] run_model # return-code=137
2020-06-15T23:10:01Z CRITICAL - failed/EXIT

Again, I can't believe this is a science problem, because as I said this suite has run successfully for over 200 years without any problems whatsoever. So clearly, based on what you said above, something technical is still going wrong here.

Interestingly, this appears to be specific to this suite. I know this because I started another suite yesterday (bv241), from exactly the same starting point in 2215, and it is running fine and has got well beyond 2216.

Do you have any other ideas as to how I can get br871 running again?

Many thanks,

Charlie

comment:7 Changed 10 months ago by charlie

Hi Grenville,

Now that nexcs has (finally) returned, would you be able to take another quick look at the above problem?

Many thanks,

Charlie

comment:8 Changed 10 months ago by grenville

Charlie

It's not the same problem - it's a problem with the NEMO namelist_cfg file It's out of sync. In general you should remove the work directores dated after the cycle you are starting at. The nemo drivers look in the latest work directory (or did), they are not cycle aware (or weren') - not sure what to advise just yet.

Grenvile

comment:9 Changed 10 months ago by grenville

Charlie

It might have been simpler to have started a whole new suite for 2015 rather than starting br871 the way it was - please try deleting /home/d05/cwilliams/cylc-run/u-br871/work/22170101T0000Z and rerigger the 2216 coupled task.

Grenville

comment:10 Changed 10 months ago by charlie

Thanks Grenville, and I apologise for the delay - I wasn't able to check this until Jasmin returned properly earlier on. Anyway, that appears to have worked, and my suites are now running again.

I will close the ticket.

Many thanks,

Charlie

comment:11 Changed 10 months ago by charlie

  • Resolution set to fixed
  • Status changed from new to closed
Note: See TracTickets for help on using tickets.