Monday 20 January 2020

arcgis desktop - Choosing Shapefile attribute data type to use?


I am trying to figure out when to use (in the Data Type of SHAPEFILE PROPERTIES) - TEXT - SHORT-INTEG - LONG-INTEG - FLOAT AND DOUBLE.


I am using ArcView 9.3 and I know when it is all text or use the date or if it is all numbers what to use.


But, when it is "2N" or "7.5" or "7%" which and what do you use?



Answer



When your purpose is to store what was written down, text is good. The other formats provide additional capabilities for data processing and analysis, while restricting or enhancing what can be stored and what can be done with the values. This has many important implications that any competent user of a GIS or database must know:





  • Restricting the format, such as to short integer, makes it impossible to store certain kinds of typographical mistakes. You cannot store "7N" or "7.5" or "312792001" as short integers. This is a good feature to have when subsequent work relies on error-free data.




  • Different data formats support different kinds of maps and graphics. You cannot display text values using a graduated color scale, for instance. Integer fields can be treated as separate categories, with a different graphical symbol automatically assigned to each. Floats (which include doubles) require binning into intervals for certain kinds of maps, such as choropleth maps.




  • Database queries work in subtly different ways. Floating point imprecision can cause quiet failures in queries like "SELECT 'Value' >= 5.2", because 5.2 cannot be represented exactly as a float or double. Thus you might think a particular instance of the field 'Value' equals 5.2, and it might display as equal to 5.2, but in a query the computer might decide it is less than 5.2.





  • Values in different formats sort differently, which affects reports and legends. For instance, the numbers 4, 9, 11 (which are in increasing order by value) would sort in the order "11", "4", "9" as text. This can mystify some GIS users.




  • The meaning of some operators changes according to format. In many databases and GISes, "11" + "4" will result in "114" (string concatenation) whereas 11 + 4 results in 15 (addition of numerical values). More subtly, integer division differs from floating point division: 11/4 as an integer division results in 2, whereas its value as a division of floats results in 2.75.




  • The precision of storage varies by format. If you try to store financial data in doubles, for instance, you will not be able to represent decimal fractions like $12.34 exactly. This can cause rounding problems in summaries, leading to imbalances in accounting systems (not good!). Decimal encoding was invented to overcome this problem. (An encoded decimal version of $12.34 might be stored internally either as 1234, with an implicit decimal point, or as the text string "12.34", where no precision is lost.)




  • Some formats can overflow or underflow. A text field can record a value like "10^2345", but such a value would overflow any numerical field, resulting in undefined behavior (perhaps giving a null value or perhaps giving a surprising value such as 0 (which is what would happen if this value were computed as a short or long integer). The result of 20000 + 15000, for signed short (16 bit) integers, is -30536, which could be a real surprise but might not be flagged as a problem by the software.





  • Conversion to other formats depends on the starting format. For example, in some systems an integral 0 will likely convert to False as a Boolean (logical) value, but a text "0" very well could convert to True (because it is nonempty).




  • Certain formats, especially dates and times, provide specialized capabilities for converting to and from text representations, for computation, display, mapping, and reporting.




  • Different formats require different amounts of storage in a dataset. A long integer or double needs 8 bytes, for instance, whereas the text representation of a double might require over 20 bytes. Integers can often be compressed natively on-the-fly (as in ESRI's integer grid storage type), sometimes providing 99+% compression automatically. These requirements, in turn, can restrict how many fields and how many records a dataset is capable of storing. The shapefile format or the older MS Access formats were limited by total number of bytes on disk, so using efficient formats for data could make the difference between successfully storing all available data or not.





Selecting the data type for any field requires considering all these issues and compromising among their effects: there is no universal "best" solution. It can be a painful, costly surprise to choose wrongly at the outset, spend resources in building a database, only to discover it cannot support required operations and has to be rebuilt from scratch. Even after the design decisions have been made, correct use of the data requires a constant ongoing awareness of the pitfalls listed here so that the data are processed without corruption and correctly displayed, sorted, graphed, mapped, and analyzed.


No comments:

Post a Comment

arcpy - Changing output name when exporting data driven pages to JPG?

Is there a way to save the output JPG, changing the output file name to the page name, instead of page number? I mean changing the script fo...