Opened 11 months ago
Closed 10 months ago
#3289 closed help (fixed)
Multiple failures since NEXCS return
Reported by: | charlie | Owned by: | um_support |
---|---|---|---|
Component: | UM Model | Keywords: | |
Cc: | Platform: | NEXCS | |
UM Version: | 10.7 |
Description
Hi,
Sorry to bother you, but ever since the NEXCS failure over the last couple of days and it's return this morning, some of my tasks have not been submitting/running and giving me errors I haven't seen before. My suite is br871 and has been stable for over 300 model years so far, so I can't believe this is a scientific problem.
Currently, my coupled stage of year 2217 has failed, giving me the following error:
BUFFIN: Read Failed: Inappropriate ioctl for device ???????????????????????????????????????????????????????????????????????????????? ???!!!???!!!???!!!???!!!???!!! ERROR ???!!!???!!!???!!!???!!!???!!! ? Error code: 22 ? Error from routine: io:buffin ? Error message: Error in buffin errorCode= 0.00 len=17408/28160 ? Error from processor: 0 ? Error number: 13 ????????????????????????????????????????????????????????????????????????????????
Likewise, my postproc_nemo task of the preceding year has also failed, giving me the following error:
[WARN] file:atmospp.nl: skip missing optional source: namelist:moose_arch [WARN] file:nemocicepp.nl: skip missing optional source: namelist:moose_arch Traceback (most recent call last): File "/home/d05/cwilliams/cylc-run/u-br871/share/fcm_make_pp/build/bin/main_pp.py", line 123, in <module> main() File "/home/d05/cwilliams/cylc-run/u-br871/share/fcm_make_pp/build/bin/main_pp.py", line 116, in main run_postproc() File "/home/d05/cwilliams/cylc-run/u-br871/share/fcm_make_pp/build/bin/main_pp.py", line 80, in run_postproc getattr(model, meth)() File "/projects/nexcs-n02/cwilliams/cylc-run/u-br871/share/fcm_make_pp/build/bin/timer.py", line 115, in wrapper out = function(*args, **kw) File "/projects/nexcs-n02/cwilliams/cylc-run/u-br871/share/fcm_make_pp/build/bin/modeltemplate.py", line 757, in create_means fn_full) File "/projects/nexcs-n02/cwilliams/cylc-run/u-br871/share/fcm_make_pp/build/bin/modeltemplate.py", line 1156, in fix_mean_time do_bounds=self.naml.processing.correct_time_bounds_variables) File "/projects/nexcs-n02/cwilliams/cylc-run/u-br871/share/fcm_make_pp/build/bin/netcdf_utils.py", line 134, in fix_times mean_time_var[:] = correct_time(meanset, time_var, units, calendar) File "/projects/nexcs-n02/cwilliams/cylc-run/u-br871/share/fcm_make_pp/build/bin/netcdf_utils.py", line 109, in correct_time dates_in = [time_var_to_date(fname, time_var) for fname in meanset] File "/projects/nexcs-n02/cwilliams/cylc-run/u-br871/share/fcm_make_pp/build/bin/netcdf_utils.py", line 103, in time_var_to_date date = num2date(time[:], time_units, time_cal) File "utils.pyx", line 252, in netCDF4.num2date (netCDF4.c:5968) File "/opt/python/gnu/2.7.9/lib/python2.7/site-packages/netCDF4-1.1.5-py2.7-linux-x86_64.egg/netcdftime/netcdftime.py", line 833, in num2date date = _DateFrom360Day(jd) File "/opt/python/gnu/2.7.9/lib/python2.7/site-packages/netCDF4-1.1.5-py2.7-linux-x86_64.egg/netcdftime/netcdftime.py", line 477, in _DateFrom360Day (F, Z) = math.modf(JD) TypeError: only length-1 arrays can be converted to Python scalars [FAIL] main_pp.py nemo # return-code=1 2020-06-05T09:58:08Z CRITICAL - failed/EXIT
What do either of these mean?
Thanks,
Charlie
Change History (11)
comment:1 Changed 10 months ago by grenville
comment:2 Changed 10 months ago by charlie
Hi Grenville,
Sorry, do you mean the "Restarting from archived restarts" section, rather than the "Restarting Failing Suites" section? I have done the former many many times, but I have not done the latter ever before, and it doesn't seem relevant to my particular error. Or am I not understanding something?
Charlie
comment:3 Changed 10 months ago by grenville
Hi Charlie
Sorry the slow reply.
I did mean Restarting Failing Suites, but as you can no doubt tell, I am not certain that the suite will restart - when partial sum files go out of sync, it's hard to tell.
Starting a new suite from a suitable point in the past might the better alternative. You should be able to judge at what point in time to start from looking at what diagnostics have been written succesfully. The suite is configured to create decadal means - you might need to create those from monthly or seasonal means rather than backing up too far.
Grenville
comment:4 Changed 10 months ago by charlie
Hi Grenville,
Very many thanks, that's what I will do, I think.
Given that I am not particularly interested in decadal means, and even if I was can easily recreate these manually at a later date, then can I not just go back to the previous year and restart from there? In other words, the model failed at year 2216, so can I not go back to 2215 (saving the original version of this, so I don't get a gap) and then run from there? I appreciate this might mean that this particular decadal mean is not calculated correctly, but as I said that doesn't really matter to me, as long as it runs from the technical side.
Charlie
comment:5 Changed 10 months ago by grenville
Hi Charlie
Yes, that's probably OK. Bear in mind seasonal and yearly means too.
(thanks for the new key)
Grenville
comment:6 Changed 10 months ago by charlie
Hi Grenville,
Sorry for the delay in getting back to you. I have now done as you suggested, rewinding to 2215 (the year before the suite failed) and restarting from this. It successfully completed 2215, but has yet again failed at 2216 (where it failed before) giving me the following error:
atpAppSigHandler: Back-end never delivered its pid. Re-raising signal. atpAppSigHandler: Back-end never delivered its pid. Re-raising signal. atpAppSigHandler: Back-end never delivered its pid. Re-raising signal. atpAppSigHandler: Back-end never delivered its pid. Re-raising signal. _pmiu_daemon(SIGCHLD): [NID 03680] [c5-1c0s8n0] [Mon Jun 15 23:09:58 2020] PE RANK 1010 exit signal Aborted atpAppSigHandler: Back-end never delivered its pid. Re-raising signal. [NID 03680] 2020-06-15 23:09:58 Apid 108351778: initiated application termination [FAIL] run_model # return-code=137 2020-06-15T23:10:01Z CRITICAL - failed/EXIT
Again, I can't believe this is a science problem, because as I said this suite has run successfully for over 200 years without any problems whatsoever. So clearly, based on what you said above, something technical is still going wrong here.
Interestingly, this appears to be specific to this suite. I know this because I started another suite yesterday (bv241), from exactly the same starting point in 2215, and it is running fine and has got well beyond 2216.
Do you have any other ideas as to how I can get br871 running again?
Many thanks,
Charlie
comment:7 Changed 10 months ago by charlie
Hi Grenville,
Now that nexcs has (finally) returned, would you be able to take another quick look at the above problem?
Many thanks,
Charlie
comment:8 Changed 10 months ago by grenville
Charlie
It's not the same problem - it's a problem with the NEMO namelist_cfg file It's out of sync. In general you should remove the work directores dated after the cycle you are starting at. The nemo drivers look in the latest work directory (or did), they are not cycle aware (or weren') - not sure what to advise just yet.
Grenvile
comment:9 Changed 10 months ago by grenville
Charlie
It might have been simpler to have started a whole new suite for 2015 rather than starting br871 the way it was - please try deleting /home/d05/cwilliams/cylc-run/u-br871/work/22170101T0000Z and rerigger the 2216 coupled task.
Grenville
comment:10 Changed 10 months ago by charlie
Thanks Grenville, and I apologise for the delay - I wasn't able to check this until Jasmin returned properly earlier on. Anyway, that appears to have worked, and my suites are now running again.
I will close the ticket.
Many thanks,
Charlie
comment:11 Changed 10 months ago by charlie
- Resolution set to fixed
- Status changed from new to closed
Charlie
It appears that a partial sum file has gone wrong.
I can only suggest that you rerun from 22160101T0000Z. You will need to organise start files and history file as described in https://code.metoffice.gov.uk/trac/moci/wiki/tips_CRgeneral (Restarting Failing Suites) — I think you have done that before. Then
rose suite-run --warm 22160101T0000Z
If this doesn't work, I can only suggest starting a new suite from a suitable point and sort out climate means manually.
Grenville