I have generated zonal statistics using feature zone data on a value raster. This shapefile contains about 20000 polygons and the analysis went without a hitch.
While testing quality, I ran the same operation on 10 polygons taken from the same data. The values in zonal max changed this time. I compared the values by overlaying the polygons on the value raster and found that the pixel values belong in the polygon, but just not the maximum pixel.
I am aware that an internal raster conversion takes place for computing stats. I want to know what exactly happens during the process of zonal statistics and whether sample size affects the output.
I have ArcGIS 10 with an Arcview license and spatial analyst.
Answer
Zonal stats ultimately is a raster-on-raster operation. When the polygons are provided in vector format, they will automatically be converted to raster format. I would hope that this would be the same cellsize and registered with the value raster, but I'm not sure of that. (It would take some difficult reverse engineering to check. Possibly the polygons are converted using whatever spatial analysis properties are in effect and then later are resampled for the zonal stats operation. That would be silly, but it's not hard to see how the software might be engineered that way.)
(The conversion to raster format explains why, by the way, you cannot directly perform zonal stats with overlapping polygons. Each raster cell can represent at most one polygon at a time.)
Historically, Spatial Analyst has had a hard time with zonal stats. Some relatively recent versions (as recently as 9.3, I recall) raised a lot of questions on the ESRI forums concerning wrong answers, missing values, and outright failures when more than about 2000 polygons were involved. Some of these (if not all) I suspect may have been due to misunderstandings about how to use zonal stats and what it does. In particular, the conversion from vector polygon to raster assigns a cell to a polygon if and only if the cell's center is contained in the polygon. (How the decision is made when the center lies on the polygon's boundary is another undocumented mystery requiring some tedious reverse engineering.) When a tortuous polygon manages to thread its way around all the nearby cell centers, that polygon simply disappears from the results altogether!
There's even more going on under the hood and it, too, has been changing from version to version of the software. When on-the-fly projection is involved, the polygons might or might not be reprojected before conversion to raster. So might the raster itself, for that matter. Precisely what happens may depend on the environment: whether this is in the Raster Calculator, the command line, a Python script, a tool, or whatever.
Being aware of all this is more than half the battle, because we can protect ourselves from the software. The best way is to create a raster of the polygons that has the same cellsize, with registered cells, as the value raster, and has the same projection. Disable any automatic resampling or on-the-fly projection. In this way you minimize the invisible "help" the software is giving you and have a reasonable chance of knowing precisely what the inputs to the calculation are. If you do that and still find discrepancies, then (IMHO) you have ample basis for a formal bug report.
(By following these procedures religiously, I have had no problems with zonal stats in any version of Spatial Analyst (1.0 through 9.3), even with very large grids and large numbers of polygons. For example, I obtained correct values for a set of Census blocks throughout a large US state at a resolution of 10 meters. (That would have been over 10^5 polygons, I recall.) However, I sometimes take extraordinary steps to preclude problems: I accomplished that Census block calculation by automatically breaking the state into its 50 or so counties, performing the calculation for each county, and then reassembling the results. Among other things, this allowed parallelization of the computation. It also allowed me to handle the tiny polygons, which weren't captured during the rasterization process, in a separate step whereby the value grid was queried at the polygon centroids. These values served as surrogates for the zonal means being computed.)
Incidentally, cell size ("sample size"?) affects some output in important ways. When you resample the value raster,
The number of value cells within a polygon will change.
Individual values are interpolated, causing a slight change in their overall statistical distribution. This can include changes in maxima and minima, which tend to shift towards the mean.
Statistics related to commonality and frequency (such as a mode) can change entirely.
Some geometric summaries, especially perimeter, can change arbitrarily.
Many fancier statistical calculations, such as standard errors of estimates or regressions, can be entirely misleading due to the arbitrary changes in numbers of cells (see #1).
No comments:
Post a Comment