Dask

A Brief Description of Dask Including Its Origins, What Makes It Unique, and It's Strengths

Summary

Dask is a Python package designed to provide a native Python-friendly approach for processing vast volumes of data on heavy-duty computing hardware. The project enables converting a data scientist’s straightforward Python solution that runs on a laptop to a scalable production-ready solution with minimal effort.

Python has a well-developed set of open-source libraries (e.g., NumPy, Pandas, Scikit-Learn, etc.). These libraries, while powerful, are designed for data analysis on small datasets—ones that fit within the memory & storage limits of a single PC. Many organizations & applications can generate terabytes of data daily, data well beyond the limits of a laptop or similar commodity hardware. For analysis of “big data”, clusters of computers with many processors and distributed storage systems are deployed to use data meaningfully; otherwise, computations can take weeks or months or years on a data scientist’s laptop.

Unfortunately, it is not as simple as putting the data in the cloud or on a cluster and running the code again. Traditionally, programmers have to learn the fundamentals of parallel computing using hosts of complicated programming frameworks to begin to scratch the surface. These frameworks are traditionally quite challenging to learn and to deploy (and are often tailored for bespoke hardware configurations). Porting an existing Python data analysis pipeline to a scalable solution for industrial-scale datasets typically requires tremendous amounts of software and data engineering expertise (with commensurate budgets).

Dask Design & Structure

Matthew Rocklin initially released Dask in 2014 while working at Continuum Analytics (now Anaconda). Since 2014, Dask has attracted a large community of users and open-source developers. It is a serious contender among competing Python-friendly technologies like PySpark for working with big data. Rocklin recently founded a company Coiled Computing built around Dask providing support, services, and education.

There are three overarching components to Dask: a high-level API, a low-level API, and a collection of schedulers. The high-level API comprises convenient data structures like Dask dataframes, Dask arrays, and Dask bags. The low-level API reveals the inner workings of the package through Dask Futures and Delayed objects that enable fine-grained control of the execution of tasks. Finally, the choice of Dask scheduler provides a convenient interface for sequencing the execution of tasks in a computation. For those who understand parallel computing, they can fine-tune the choice of scheduler; for the rest of us, the default scheduler is sufficient for prototyping models before moving to production.

Most data scientists learn Dask through its high-level API. This API includes two very convenient data structures: Dask dataframes & Dask arrays. The fundamental design of these data structures closely mimics their counterparts from the standard PyData stack (Pandas dataframes & NumPy arrays respectively). This design significantly eases the transition to working with production-ready distributed computing systems for big data. Someone conversant with NumPy & Pandas alone can quickly start using big-data systems without having to struggle through the theory of parallel programming. As a consequence, they can easily convert existing data analysis pipelines prototyped on a laptop into scalable production-ready systems.

Strengths of Dask

Dask has a number of strengths that play a large role in its accelerated adoption. As mentioned already, Dask provides simple ways to port NumPy & Pandas code run on medium to large data sets. Moreover, through dask-ml, Scikit-Learn can similarly be extended to use with medium to large datasets. Code can be prototyped using Dask on a single machine (e.g., a laptop), but the same code can be applied to medium or large datasets on a cluster or in the cloud by setting the scheduler up appropriately. Most Python objects and non-standard algorithms can be parallelized using the low-level Dask API. The Dask scheduler provides convenient interfaces to cluster resource managers like Mesos, YARN, & Kubernetes. At the same time, configuring the task scheduler & deploying Dask across multiple machines has relatively low maintenance overhead and configuration requirements. Finally, the Dask implementation is fully in Python and can be installed quickly & easily with, e.g., conda or pip.

For more help with Dask for your business, connect with us and we can give you a full systems evaluation to find out if you are getting the most out of your data science software.

Connect with us to talk about a technology evaluation today

© QUANSIGHT 2020

  • LinkedIn - White Circle
  • Facebook - White Circle
  • Twitter - White Circle
  • White YouTube Icon
github.png