Thursday, 2 January 2020

"Oddities" in the Shapefile technical specification


I've been writing a shapefile parsing library, and have encountered a couple of design decisions in the specification that I don't immediately understand. I'm hoping there's a wizened old ESRI developer around here who can tell me why these things are the way they are.





  1. The main record file (.shp) is of mixed endianness. Specifically, parts of the header features big endian byte ordering, but the records are all little endian. I typically work at a higher level than bytes and bits, but everything I've so far read about endianness marks this as unusual. Why isn't the file specified to be of uniform endianness?




  2. The "File Length" field, as well as other length and position fields, are recorded in 16-bit words, instead of the more standard (from my limited perspective) 8 bit positioning. How did this decision get reached?




I posted a similar question on Stack Overflow, but didn't get any response. If this seems too off topic to other people, I could support closing it.



Answer



The development of shapefiles was concurrent with the development of ArcView, which was specifically designed to be platform independent. (In fact, that turned out to be its downfall: by relying on an interface developed in a platform independent GUI called "Neuron Data," it could not take advantage of many Windows capabilities. It ended up reflecting the worst of all the systems it was marketed for.) Although the shapefile specification was weird from the beginning, it made a loopy sort of sense within this design framework: because shapefiles were intended for many platforms, their specification should not favor any one of them and therefore should be equally obnoxious to programmers of all persuasions.



The second question appears to be based on an assumption that is not true. For instance, the "File Length" field appears at byte offset 24 in the main header and is a (signed) four-byte (32 bit) integer, as it must be in order to represent a length of up to 2^31-1. It is preceded by a four-byte "File Code" and five more four-byte fields reserved for future use: when you're reserving such space, of course you want to make the fields as large as reasonably possible, which at the time was 32 bits, in order to maintain the greatest possible flexibility. It helps, too, to align numeric fields in a file on word boundaries: machine-level code to parse them is a little easier to write and it can avoid potential (subtle) problems with upper-level compilers that might automatically pad their STRUCTs to align with words or doublewords.


No comments:

Post a Comment

arcpy - Changing output name when exporting data driven pages to JPG?

Is there a way to save the output JPG, changing the output file name to the page name, instead of page number? I mean changing the script fo...