Opened 4 years ago
Closed 4 years ago
#1732 closed help (answered)
xconv still segfaulting
Reported by: | iwi | Owned by: | jeff |
---|---|---|---|
Component: | UM Tools | Keywords: | |
Cc: | Platform: | Other | |
UM Version: | <select version> |
Description
After it seemed that upgrading xconv to 1.93 fixes segfaults when opening some PP files from Met Office NWP, I have a report from a user regarding some cases where it is still segfaulting with 1.93. They are large files, but a test case with a much smaller, subsetted file produces similar issues.
Please see ~iwi/for-jeff/test_xconv.pp on oak, which contains the first 37 PP records from one of the problematic files. It opens okay into cf-python (on another system - I couldn't find cf-python on oak):
>>> cf.read("test-xconv.pp") [<CF Field: eastward_wind(model_level_number(25), latitude(1152), longitude(1536)) m s-1>, <CF Field: eastward_wind(model_level_number(2), time(3), latitude(1152), longitude(1536)) m s-1>, <CF Field: northward_wind(time(2), model_level_number(2), latitude(1153), longitude(1536)) m s-1>, <CF Field: UM_m01s01i230_vn805(latitude(1152), longitude(1536)) >, <CF Field: UM_m01s01i231_vn805(latitude(1152), longitude(1536)) >]
In xconv, it segfaults - and not only that, but it does not do the same thing consistently. On oak, it sometimes segfaults without a stack trace and sometimes with a stack trace. On another system (on JASMIN at CEDA) it varies about whether it segfaults on opening or on exit, and on whether there is a stack trace or not; and sometimes it opens but the information in the field list is corrupted (see screenshots). Something nasty seems to be happening with uninitialised values.
Please can you take a look.
Thanks,
Alan
Attachments (2)
Change History (6)
Changed 4 years ago by iwi
Changed 4 years ago by iwi
comment:1 Changed 4 years ago by iwi
comment:2 Changed 4 years ago by jeff
- Owner changed from um_support to jeff
- Status changed from new to accepted
Hi Alan
This problem occurs because of the way the fields are in the pp file. Looking at test_xconv.pp it has u-wind at 3 times 2015/01/12:04.00, 2015/01/12:05.00, 2015/01/12:05.00. For the first 2 times the file has u-wind on 2 levels (model level 1 and model level 11) and the 3rd time has u-wind on 27 model levels. What xconv does is first read the pp file to work out the dimensions but it assumes all fields of a particular type have the same number of levels, in this case it thinks u-wind has 2 levels and 3 times so when it tries to read the 27 levels it is reading past the array bounds and then random bad things happen.
Firstly I should ask is this the way the file is meant to be? If so then I would suggest using cf-python to convert them to netCDF as any major xconv development is unlikely to happen now. Alternatively the first 2 times could be removed from the pp file, if they aren't needed, then xconv should work ok.
Jeff.
comment:3 Changed 4 years ago by iwi
Jeff,
Thank you. That is very useful to know. I will discuss options with the user.
The test file I supplied is heavily truncated compared to the whole file, but I believe that it would be the same issue at play. The situation can arise in the UM where the same diagnostic is set up to be written on many levels with one time frequency, and also on fewer levels but more frequently, so I suspect this is part of such a time series. David has obviously gone to some effort to ensure that CF-python can handle this, and when I wrote the C code to help optimise it, we made sure that this would still be the case. If you later come to look into supporting it in xconv, it is not overly difficult to detect that such a set of PP records needs to be expressed as more than one variable (sort records by time and then by level, and then test whether they form a 2d grid in Z,T space or not), and if not then it is trivial to break it up into many variables that will work (one per time, or one per level), but it is rather harder to work out the *minimum* set of variables needed, which is what David's fancy Python code does.
Anyway, for now, I'll treat it as a limitation in xconv, and explain this to the user.
Regards,
Alan
comment:4 Changed 4 years ago by jeff
- Resolution set to answered
- Status changed from accepted to closed
Thanks Alan, I'll close this ticket now.
Jeff.
Example with stack trace:
Example without stack trace:
And looking at the core dump: