Pandas

A Brief Description of Pandas Including Its Origins, What Makes It Unique, and It's Strengths and Weaknesses

Summary

Pandas is an open-source project which began development in 2008 with support from AQR Capital Management. In 2015 it gained sponsorship from NumFOCUS. It is primarily used for data manipulation using a DataFrame object, and also provides integrated indexing for file formats such as CSV, Excel, SQL, and HDF5. It can take the database files and organize them to account for data alignment and missing data. It can also be used to merge data sets and utilizes C to improve performance. There are many users from various professional fields including medicine, finance, education, and many more. 

Origin and Versions

Wes McKinney created Pandas while working at AQR, and upon his departure, he convinced them to release the project as open-source software. From his experiences in Python data software and with his efforts to create this project, Wes also published a book titled Python for Data Analysis.

Development on the Pandas project from the developers' community is more active than ever. Pandas version 1.0 was officially released on January 29, 2020 and continues to see regular updates. There is currently an extensive roadmap of future features that they are always looking for contributors to help make a reality, to see the current list of future implementations visit their roadmap page here.

Python Design, Philosophy, and Culture

Pandas actively encourages and promotes diversity and inclusion in their development team. The internal governance headed by Wes is determined to create a welcoming contributor environment that can encourage new individuals with fresh perspectives who come from under-represented demographics.

 

The usability and user-friendliness of Pandas is of paramount importance. They have put together a guide on their website which boasts that within only 10 minutes you can be using Pandas, see it here. Extensive guides and tutorials ensure that anyone looking to start organizing their spreadsheets and databases can get the help they need to begin working without any prior knowledge or additional help from individuals.

 

The community is also guided by a strong code of conduct and two committees. The members of the team, code of conduct committee, and NumFOCUS committee can all be found here on their website. Currently, there are over 1500 volunteer contributions and that number grows every day. 

Structure and Features

The mission of the Pandas project is as follows, "pandas aims to be the fundamental high-level building block for doing practical, real world data analysis in Python. Additionally, it has the broader goal of becoming the most powerful and flexible open source data analysis / manipulation tool available in any language". To this end, they have created an open-source project which is easy to integrate into other projects. By doing this they have ensured not only their own offerings are substantial, but that it has gained usage in ways they might not have imagined. 

Perhaps Pandas greatest feature is its modularity in creating custom DataFrames. These objects can standardize data from various source formats so that they can be used across platforms. Cross compatibility with software and file formats has been an issue for data scientists for quite some time, especially in the early days where this would have been done manually to ensure the data was translated correctly. The benefits of this capability have been expanded through integrations with other open-source projects such as Spyder, and together users are able to edit the individual parts of Pandas like a spreadsheet with copying, pasting, sorting, etc. 

 

Pandas has become such an important tool that there is now an ecosystem built around it. Many new open source projects now utilize Pandas as part of their toolkit to offer more advanced data science abilities. Among the functionalities of these dependent projects are statistics, machine learning, visualizations, IDEs, APIs, in addition to a number of other implementations

Strengths and Weaknesses

There are many benefits to using Pandas. First and foremost it is a great way to get around using R, and provides a simple and easy way to do data analysis. Additionally, it is easy to read compared to Java and C. Its simplicity also comes with the benefit of being able to handle large quantitative datasets in a relatively short amount of time.

 

There are a few limitations to what kind of inputs Pandas can use, but since it is a living program there is no telling what kind of compatibility could be arranged in the future. With that in mind, at present, there are a number of other projects which are better suited for handling operations such as statistical modeling or n-dimensional arrays. 

For more help with Python for your business, connect with us and we can give you a full systems evaluation to find out if you are getting the most out of your data science software.

Connect with us to talk about a technology evaluation today

© QUANSIGHT 2020

  • LinkedIn - White Circle
  • Facebook - White Circle
  • Twitter - White Circle
  • White YouTube Icon
github.png