Updated: Apr 6
Imagine having a dataset of over 50 TB of compressed geospatial point data stored in flat files, and you want to efficiently filter data in a few zip codes for further processing. You can’t even open a dataset that large on a single machine using tools like pandas, so what is the best way to accomplish the filtering? This is exactly the problem one of our clients recently faced.
We addressed this challenge by spatially sorting the data, storing it in a partitionable binary file format, and parallelizing spatial filtering of the data all while using only open source tools within the PyData ecosystem on a commercial cloud platform. This white paper documents the potential solutions that we considered to address our client’s needs.
Figure 1: Filtering spatial data typically requires a large database to find only a few results.
Credits: NASA Earth Observatory images by Joshua Stevens, using Suomi NPP VIIRS data from Miguel Román, NASA's Goddard Space Flight Center
Our client was a small startup who needed to avoid large up-front infrastructure costs. This constrained our approaches to those possible via cloud providers like AWS, Azure, or GCP. We considered five potential approaches to meet the client’s needs, one of which employs a relational database, and four that use the PyData stack.
Relational Database with Geospatial Extension
Perhaps the first approach that comes to mind is the use of a relational database such as PostgreSQL and an extension like PostGIS which allows the use of spatial data types and queries on AWS Relational Database Service (RDS). The advantage of this approach is that it’s well established, but the strong disadvantage is cost.
Databases hosted on AWS RDS have a variety of costs, but in this case, with such a large amount of data, database storage costs dominate. Using RDS requires using more expensive database storage rather than using S3 storage. Table 1 compares the costs of RDS Database Storage and S3 Storage at the time of writing.