Jim Bednar joined us in October of last year to discuss PyViz back on Open Source Directions and returned recently to discuss Datashader, which is one of the newer projects under the PyViz umbrella. Jim spent a decade as a researcher at the University of Edinburgh and now leads the PyViz development team at Anaconda, Inc. in Austin, Texas. He has been an active member of the open source data science community for many years with contributions to a wide range of libraries, including working as the lead for the Datashader project.
Datashader works as a graphics pipeline system for creating meaningful representations of large datasets quickly and with flexibility. At the time of our interview in December 2018, it had 1464 stars on Github and about 2,300 downloads/month across PyPI and conda.
Having too much data has been an issue ever since people began keeping records. Using large datasets such as the United States Census records or earthquake epicenter logs should provide valuable insight and help increase the understanding of those scientists who study it. Unfortunately, with conventional charts and plots, there is a limit to how much information can be presented before the value begins to diminish. Having overlapping data points can be misleading, and having an extreme variance between data values makes it very difficult to get a true picture of millions or billions of data points. Without visualizations, the human mind struggles to make sense of such data once it reaches a certain scale, yet traditional forms of visualization get worse as you try to display more data. Datashader solves this problem works from the opposite extreme, working best when more data is available.
Specifically, Datashader is designed to handle a theoretical case where you have infinite data, and where the set of data points approaches the underlying continuous distribution that the data would form if it were limitless. Under this assumption, as long as Dask can handle doing the computations on real hardware, Datashader can take any dataset and turn it into an image (an array of pixels, which is essentially a 2D histogram). This image can then be embedded in a plot that accurately represents the data’s distribution, without requiring trial-and-error tweaking of plotting parameters. The end result is something that looks like a plot, but for data that would make ordinary plotting programs crash or generate unusable output (e.g. with plots just covered in data points). Without requiring any user-determined parameters, Datashader will, by default, just render the data to the screen as faithfully as possible given the available resolution.
The potential implications are only limited by user creativity, but some current use cases are:
representations of geographic data like billions of GPS coordinates,
hundreds of thousands of shipping routes,
millions of airplane flights,
millions of grocery store shopping trips, and
someone has even used it to plot every object in the known universe.
Typical users are corporations and government organizations that produce their own large private datasets, while others are everyday people who like to experiment with public datasets to see what the results will look like because it can produce some visually stunning images.
Currently, the PyViz team at Anaconda, Inc. maintains this project, with funding from a variety of client projects that use Datashader. Anyone with big data can benefit from the further development of this project, and thus contributions from the community will be invaluable to creating a final product which caters to the needs of all potential end users. Any time, money, or feedback would be appreciated and spreading the word about this amazing tool will help as well. For the demo provided by Jim, see the webinar on YouTube here.