James Bourbeau was first introduced to Dask while in graduate school where he worked with large amounts of tabular data. He found that Dask could be used to scale his existing Pandas workflow by changing just a few lines of code. James is now in the process of finishing grad school in Spring of 2019 and has recently joined Quansight. He began contributing back to Dask by fixing documentation typos or small bugs and was immediately welcomed by the Dask community. Since then he has grown his level of contribution to the project over time. While James is still a Dask user, most of his recent efforts are directed towards maintaining the project. To help get the word out about Dask, James gave a talk at the local Madison Python Meetup on February 2nd, 2019 in Madison, Wisconsin.
Among the things he shared at the meetup, two things were particularly exciting to him: the Dask JupyterLab extension and Dask-ML. The JupyterLab extension allows users to view the real-time distributed scheduler diagnostics right alongside the notebook running your code. Of this James added, “I've found this to be a really great user experience.” The other project, Dask-ML, is a Python library for enabling scalable machine learning. It works with Dask and other existing libraries in the Python ecosystem to scale machine learning computations to larger datasets and problems. One benefit of Dask-ML is that, because it implements the widely used scikit-learn API, it looks and feels familiar to people who have used the current scikit-learn code. By sharing the work being done in this space, the hope is for people to gain interest and either use Dask for themselves, or join in as part of the community.
Since this was the first talk about Dask at this meetup, James began with a general overview of what Dask is and why someone might want to use it. After this, he was able to get into both the high-level (Dask array and Dask DataFrame) and low-level (Dask delayed) interfaces for constructing Dask task graphs. Next, he spoke about how those task graphs are computed using either the local or distributed scheduler and what some of the general performance characteristics are for each of the schedulers. He then briefly showed off the real-time diagnostics and concurrent.futures interface that the distributed scheduler offers. Finally, he ended with a demo where he used Dask-ML to train a logistic regression model using the distributed scheduler. The materials James presented at the meetup are available on GitHub here. Anyone with comments or suggestions about the material should feel free to open an issue in that repository.
What makes these meetups so great is that they provide the opportunity for people to learn from one another and they broaden everyone’s horizons. James spoke to this point, “When the meetup topic is something I'm not familiar with, it's really interesting to see how Python is used in other domains. Even if I'm generally familiar with what's being presented, I always come away with some cool feature I didn't know about or a new, better way to do things.” If you would like to attend this meetup, details about all the past and upcoming events for the Madison Python meetup are available here. Visit the meetup website to search for local Python or PyData groups that may already be in your area!