Thursday 19 April 2018

spatial analyst - RMSE between two rasters step-by-step


Can anyone show how to calculate the RMSE (root mean square error) between following two rasters step-by-step and discuss on the obtained results' min and max values, and how to interpret them.


 First raster (original, 2 by 2):
1 2
3 4


Second raster (obtained, 2 by 2):
2 2
4 1

Answer



Calculation




  1. Subtract one raster from the other. (The direction of subtraction does not matter.)


    -1 0
    -1 3





  2. Square the result.


    1 0
    1 9




  3. Average the values.


    (1 + 0 + 1 + 9)/(1 + 1 + 1 + 1) = 11/4.


    (I wrote this in a suggestive way to show how missing-data cells can be handled if your GIS does not have this capability: Create an indicator grid with 1's where you have data and 0's elsewhere. Divide the sum of your grid by the sum of the indicator grid. In Spatial Analyst you can get the sums as focal sums.)





  4. Take the square root.


    Sqrt(11/4) = 1.66




Interpretation


This number is a measure of the typical cell-by-cell difference between the two grids. When the grids have hundreds of values or more (as most do), they do not exhibit huge extremes or outlying values, and the average difference is zero, then the standard rule of thumb for interpreting the rmse is:





  • About 2/3 of all the cells will differ by less than the rmse.




  • About 95% of all the cells will differ by less than twice the rmse.




  • It will be unusual to see differences more than three times the rmse.




In a grid of any size (e.g., a million cells), "unusual" still translates to several thousand cells: around a fraction of one percent of all of them.



In the example--which is trivially small--knowing there are 4 cells and the rmse is 1.66, we would think "about 2/3 -- say 2 or 3--of the cells agree to within 1.66. Probably all of them agree to within 2*1.66 = 3.32." The actual state of affairs, as we can see from the result of step (1), is that 3/4 of the cells agree to within 1.66 and all of them indeed agree to within 3.


When the grids vary wildly and exhibit huge ranges of values, you might mistrust the rules of thumb. From Chebyshev's inequality you still know that




  • No more than 1/4 of the cells differ by more than twice the rmse.




  • No more than 1/9 of the cells differ by more than three times the rmse.





  • In general, pick any number k equal to 2 or greater. No more than 1/k^2 of the cells differ by more than k times the rmse.




This is a universal rule, valid for any pair of grids, whereas the previous rule of thumb assumes the distribution of cell differences is roughly "bell shaped" without many extreme outliers.


Edit


The preceding interpretations assume you are comparing two grids intended to represent the same thing, up to measurement error, so that their average difference is zero (or near enough to it). When the average difference is appreciable (compared to the rmse), these interpretations are incorrect--but then it also rarely makes sense to use the rmse. Instead, one would (a) report the average difference and (b) subtract its square after step (3). This gives the mean square residual rather than the mean square difference. Its square root is the typical size of variations between the two grids relative to their average difference. With this caveat, the interpretation can use the same rules of thumb as before.


No comments:

Post a Comment

arcpy - Changing output name when exporting data driven pages to JPG?

Is there a way to save the output JPG, changing the output file name to the page name, instead of page number? I mean changing the script fo...