Open Source Natural Language Processing Libraries To Get You Started

Updated: Mar 6


Fatma Tarlaci compares features of three natural language processing tools NLTK, spaCy, and SparkNLP

The relatively recent advancements in open source Artificial Intelligence (AI) technologies is revolutionary and demonstrable of producing unrivaled value to computing. Chief among the subsets of AI is Natural Language Processing (NLP), which constitutes my primary focus and area of expertise. I am also an ardent advocate of Open Source Software (OSS), so it follows that I believe AI ought to be both transparent and inclusive. NLP not only fits this belief, it has arguably been one of the domains of AI that has benefited from the openness and ubiquity of OSS the most.


In this post, I will explain the subtle nuances among popular OSS NLP libraries, namely Natural Language Toolkit (NLTK), SpaCy, and Spark NLP, and comment on their relative performance and specific applications. My preference for discussing these three libraries is informed by the following observations: how intuitive the library seems to be, their general performance, their overall strengths, and how well-maintained their open source repositories are.


The goal with this post is to assist those who are just getting started with NLP, whether your motivation is for your own recreational learning or whether it is to enhance the profitability of your business.


OSS in AI


Although the genesis of AI dates back to the 1950s — when Alan Turing published his seminal paper, “Computing Machinery and Intelligence,” hypothesizing whether machines possess the capacity to think, learn and demonstrate reason, and Frank Rosenblatt invented the “perceptron algorithm,” formulating the basis for binary classification — it was not until recently that AI made a quantum leap.


Notwithstanding the significance of such AI pioneers’ contributions, the field of AI had remained largely stagnant for decades, suffering through what was called the AI Winter. The recent advancements in AI, particularly in Deep Learning and its wide range of applications, have become possible not only because of the unprecedented computing power of modern graphics processing units (GPUs) but also thanks to OSS, representing remarkable collective intelligence in the service of AI.


Community-driven open source libraries, such as scikit-learn and pandas, as well as company-backed open source AI frameworks, like TensorFlow and PyTorch, have accelerated AI advancement. They have enhanced the ability of AI practitioners, machine learning engineers, computer and data scientists to create and make use of AI applications in industry, academia, and scientific research. The impact of OSS in AI is also indicated when considering that Python has become the language of choice to represent AI. This is greatly due to Python’s extensive scientific computing ecosystem, which is mainly powered by the flexibility of OSS.