Episode #37: PyJanitor

Updated: Apr 3, 2020

Featuring Developer: Eric Ma

Episode #37

Air Date 06 March 2020

@12 PM Eastern

We will be joined by Eric Ma, developer on the PyJanitor project, who will tell us about the future of PyJanitor. pyjanitor is a project that extends Pandas with a verb-based API, providing convenient data cleaning routines for repetitive tasks. Originally a port of the R package, pyjanitor has evolved from a set of convenient data cleaning routines into an experiment with the method chaining paradigm.

Data preprocessing usually consists of a series of steps that involve transforming raw data into an understandable/usable format. These series of steps need to be run in a certain sequence to achieve success. We take a base data file as the starting point, and perform actions on it, such as removing null/empty rows, replacing them with other values, adding/renaming/removing columns of data, filtering rows and others. More formally, these steps along with their relationships and dependencies are commonly referred to as a Directed Acyclic Graph (DAG).

The pandas API has been invaluable for the Python data science ecosystem, and implements method chaining of a subset of methods as part of the API. For example, resetting indexes (.reset_index()), dropping null values (.dropna()), and more, are accomplished via the appropriate pd.DataFrame method calls.

Inspired by the ease-of-use and expressiveness of the dplyr package of the R statistical language ecosystem, we have evolved pyjanitor into a language for expressing the data processing DAG for pandas users. To accomplish this, actions for which we would need to invoke imperative-style statements, can be replaced with method chains that allow one to read off the logical order of actions taken.