Wednesday, 6 September 2017

raster - How to efficiently access files with GDAL from an S3 bucket using VSIS3?


So, GDAL has recently added a new feature that allows random reading of S3 bucket files. I am looking to crop GDAL images from multiple tiles of an image without having to download the whole file. I've only seen very sparse documentation on how to configure and access an S3 bucket though GDAL and am a little confused on how to begin? Would someone be kind enough to provide an extremely short example/tutorial on how one would go about setting the virtual filesystem for GDAL in order to accomplish this goal? Bonus pts if your solution allows it to be scripted via Python!


To clarify: We already have it done in Python. The issue with Python is that you have to download the whole image to operate it with it. The newest version of GDAL has support for mounting the S3 bucket so that if we need to say a crop a small portion of the image, we can operate directly on that smaller portion. Alas, as the feature only was released on the stable branch in January, I haven't found any documentation on it. So the solution should use the VSI3 system in the newest release of GDAL or otherwise smartly uses the system to prevent the user from needing to download the entire image to an EBS drive to operate on it.


That is to say the bounty will be awarded to answer that uses the VSI APIs found in the newest versions of GDAL so that the whole file does not need to be read into memory or disk. Also, we the buckets we use are not always public so many of the HTTP tricks being posted won't work in many of our situations.



Answer



I've found when something isn't particularly well documented in GDAL, that looking through their tests can be useful.


The /vsis3 test module has some simple examples, though it doesn't have any examples of actually reading chunks.


I've cobbled together the code below based on the test module, but I'm unable to test as GDAL /vsis3 requires credentials and I don't have an AWS account.



"""This should read from the Sentinal-2 public dataset
More info - http://sentinel-pds.s3-website.eu-central-1.amazonaws.com"""

from osgeo import gdal
import numpy as np

# These only need to be set if they're not already in the environment,
# ~/.aws/config, or you're running on an EC2 instance with an IAM role.
gdal.SetConfigOption('AWS_REGION', 'eu-central-1')
gdal.SetConfigOption('AWS_SECRET_ACCESS_KEY', 'MY_AWS_SECRET_ACCESS_KEY')

gdal.SetConfigOption('AWS_ACCESS_KEY_ID', 'MY_AWS_ACCESS_KEY_ID')
gdal.SetConfigOption('AWS_SESSION_TOKEN', 'MY_AWS_SESSION_TOKEN')

# 'sentinel-pds' is the S3 bucket name
path = '/vsis3/sentinel-pds/tiles/10/S/DG/2015/12/7/0/B01.jp2'
ds = gdal.Open(path)

band = ds.GetRasterBand(1)

xoff, yoff, xcount, ycount = (0, 0, 10, 10)

np_array = band.ReadAsArray(xoff, yoff, xcount, ycount)

No comments:

Post a Comment

arcpy - Changing output name when exporting data driven pages to JPG?

Is there a way to save the output JPG, changing the output file name to the page name, instead of page number? I mean changing the script fo...