Sunday 18 August 2019

vrt - Gdal won't read geotiff in parallel, but will read it in serial, and will read virtual raster in parallel. Why?


My algorithm takes a global dataset at 30m resolution, breaks it up into pieces, analyzes the pieces somehow, and writes the analyzed piece to disk. As all the pieces are independent, this is embarrassingly parallel.


I am using the joblib package to parallelize tasks. I have two global datasets: vrt.vrt (virtual raster) and tiff.tiff (geotiff). When I run a parallelized task that reads from vrt.vrt, I have no problems. However, when I run a parallelized task that reads from tiff.tiff, I get the following error message:



ERROR 1: LZWDecode:Wrong length of decoded string: data probably corrupted at scanline 368132
ERROR 1: TIFFReadEncodedStrip() failed.
ERROR 1: /path/landmasses.tif, band 1: IReadBlock failed at X offset 0, Y offset 368132
ERROR 1: GetBlockRef failed at X block offset 0, Y block offset 368132

The read command is the same for vrt.vrt and tiff.tiff:


tiff = gdal.Open('tiffpath')
start parallel
tiledata = tiff.ReadAsArray(
xoff = int(xmin-npad),

yoff = int(ymin-npad),
xsize = int(ncol+2*npad),
ysize = int(nrow+2*npad)
)

However, when I run the tiff.tiff in serial, there are no errors and the output files look correct.


So does anyone know why gdal can't read from a .tiff using multiple threads, but can for a virtual raster?


EDIT:


In effort to progress, I built a virtual raster from the single geotiff file, and parallelization works fine. The question remains, however.



Answer




I think the most likely reason for the different behavior is related to when and how the data is accessed.


In the case of the TIF, there is essentially a TIF-access pointer that is being shared between processes/threads. The reading of information from thread A interferes with thread B that is also trying to read data using the same in-memory interface. This would explain the error around incorrect data length. Thread A expects a certain amount and format of data to come in but thread B has just moved where the data is coming from. Thread A receives data that is inconsistent with what it expects and thinks there is a corruption.


When using a VRT, on the other hand, the threads share the same information about where to find data (what files), but they aren't sharing the same TIF-level accessor. It isn't until data is actually requested that an accessor is instantiated. This means each thread is getting its own accessor. The test you performed where the single TIF is put inside a VRT is further evidence for this.


No comments:

Post a Comment

arcpy - Changing output name when exporting data driven pages to JPG?

Is there a way to save the output JPG, changing the output file name to the page name, instead of page number? I mean changing the script fo...