Friday 8 April 2016

algorithm - Explaining different Standard Deviation results from same data in ArcGIS Desktop and MS Excel?


I have done some interpolation in Geostatistical Analyst in ArcGIS and I got the Standard Deviation (SD) a bit different from the SD that I calculated in Excel Spreadsheet.


What algorithm does each package use? Why are the results different?



Answer



Introduction


Because this issue (of discrepancies in standard deviations, variances, or other statistical summaries) comes up periodically, especially when a thoughtful and careful GIS analyst checks their work, I thought it would be good to share the "forensic analysis" of the discrepancy so that readers can carry out similar checks in their own applications.



Analysis


Excel provides two kinds of standard deviation, STDEV and STDEVP. The first is the square root of an unbiased estimator of the variance (VAR); it is intended for making inferences about a population from a random sample thereof. The second computes the standard deviation of a set of numbers, the square root of the variance (VARP).


In both cases, the standard deviations are root mean squares about a mean. They differ in how the averaging is performed in finding the root mean square. Normally, when we average N numbers, we sum them and divide by N. That indeed gives the root mean square, where the numbers involved are the squares of the data expressed as displacements from their mean (their "residuals"). The unbiased estimator instead divides by N-1. (It uses the same mean to compute the residuals, though.)


It is now straightforward to work out the relationship between STDEV and STDEVP: multiplying VAR by N-1 and multiplying VARP by N both give the sums of squares of residuals. Therefore VAR = VARP * N / (N-1), whence


STDEV = STDEVP * SQRT(N / (N-1))

Evidently STDEV is always greater than STDEVP.


Notice that we can solve for N given both STDEV and STDEVP:


(STDEV / STDEVP)^2 = N / (N-1) = 1 + 1 / (N-1);
N = 1 + 1 / ((STDEV / STDEVP)^2 - 1).


Solution


Look now at the numbers in the question: the ArcGIS value is 0.336858 and the Excel value is 0.338572: it is the larger of the two. Perhaps ArcGIS is using STDEVP? Plugging these into the preceding formula gives


N ?=? 1 + 1 / ((0.338572 / 0.336858)^2 - 1) = 99.01726

That's off a little bit, but then again the two SDs, although computed to double precision, are reported only to six significant figures, so let's check the ArcGIS result by correcting it with a multiplication by Sqrt(N / (N-1)):


0.336858 * Sqrt(99 / 98) = 0.338572,

exactly correct to the precision given.


Conclusions



The ArcGIS SD (in this part of ArcGIS: the software is inconsistent in which SD it uses) likely is the same as Excel's STDEVP value, which is the "uncorrected" or "raw" or "population" standard deviation.


Advice


The values of STDEV and STDEVP are relatively close when N is large and get ever closer as N increases. Even for N = 10 they are only about 5% apart. Therefore the distinction between the two statistics is important (a) only for small datasets, (b) when they are being used as ingredients in complex formulas (which might amplify any differences), and (c) when your work is being checked, either by yourself or an independent party (such as a peer reviewer for a paper or an opposing expert in litigation). Therefore, except in these kinds of cases, you needn't be concerned about which formula is used.




For more on problems with SD calculations in ArcGIS please see Spatial Analyst Cell Statistics seems to give the wrong answer in ArcGIS 10.0. More discrepancies and bugs are documented on the old ESRI forums. Briefly, ESRI software tends to use STDEVP wherever it reports a "standard deviation," but not always; the exceptions appear to vary from one release to another; with certain large datasets (such as rasters) it may use only a portion of the data to compute statistics; and some portions of ArcGIS, such as using standard deviations for setting class breaks in a legend, appear (at least in some releases) to have mysterious errors.


No comments:

Post a Comment

arcpy - Changing output name when exporting data driven pages to JPG?

Is there a way to save the output JPG, changing the output file name to the page name, instead of page number? I mean changing the script fo...