A spark of Python genius

According to the TIOBE Index, between January 2018 and January 2019, Python became the third most popular programming language, passing up C++ (fourth) and behind C (second) & Java (first). By the same index, Python was named the programming language of the year due to its incredible growth across many industries and applications. However, there are also some great tools designed for Java which could help to make life much easier for Python users who deal with massive datasets. The desire to provide big data capacity to Python tool users lead to the creation of a new API designed to expose the Spark programming model to Python. This API is called PySpark.

The orange and black PySpark logo

Holden Karau, focuses most of her effort at Google these days on Apache Spark. While she did not start the project, it has continued to grow and develop through her efforts. PySpark was started at a time when there was a lack of high-quality big data tools for Python. Since that time the scope has been widened to include the ability for Python users to access JVM libraries. The opposite is also made possible with PySpark; Python code can now be accessed from JVM libraries and on Spark. This sort of interoperability comes in handy when you are working in a data science/data engineering heavy workplace because you may want to use a variety of tools for your workflow and need a consistent way of gluing them together.

When thinking about open source projects, people often forget about the people who maintain these projects. PySpark is an Apache Software Foundation code. The benefit of this is that it is not a single company supporting PySpark. Rather, through Apache, there are many companies and organizations who help support this PySpark. Holden, being a primary contributor, is an employee of Google and represents the extensive cooperation behind the project. Spark itself is also a very large project, and the Python portion is a smaller subset of a broader initiative. Part of the reason there are fewer people working on this portion of the project is that it requires a knowledge of Java, Scala, and Python! In addition to Holden, there are currently only 3 people who are working on this as regular, paid contributors; Davies Liu, Hyukjin Kwon, and Bryan Cutler. Here is the full list of committers and Project Management Consultant members, also known as Apache Spark Committee members.

The PySpark data science cheat sheet in the form of a table with different headers corresponding to different kinds of operations

With everything that is planned for this project, there is more need than ever for people to jump in and help support this project. Currently, the contributors come largely from the JVM side, and as such, it would be helpful for more Python oriented programmers to help out with some of the wishlist items being dreamed up. One such wishlist item involves expanding multi-language support, and is important because it could mean that less code will need to be ported between languages. PySpark would also like the shared memory buffer to be more reliable in order to make the multi-language pipelines more useful. In addition to this, the SQL interface should be made more efficient and enable it as a multilanguage pipeline, for example by implementing Types. There is also a need to support more streaming, though the best method for doing so has not been determined. One promising option for a new streaming engine would be to have a low-guarantee system that is great for speed but does not hold as high a standard for keeping track of your data. Another option is to have a middle ground which would track your data better and attempt to maintain decent speeds.

There is currently a lot of work being done on the future of machine learning with PySpark. These developments have been too extensive to include many details here, so if you wish to stay up-to-date head on over to the PySpark news page here. If you would like to contribute to these developments, then there are many ways to help out including sorting through the hundreds of open pull requests. Find out how you might be able to help out most by visiting the GitHub page here or the Apache Spark community page here. Additional code review videos can also be found here.

36 views0 comments
..... ..... .....
..... ..... .....
...... ......