Ibis and OmniSci; A powerful combination for easy access to fast big data analytics with Python.

# This article is intended for readers who are not experienced 
# Python developers. 
# If you are a developer and would like to know more about Ibis, 
# this article provides a more technical overview with code 
# examples.

In order to easily and quickly perform analysis on big data, you need to have a convenient way to write the algorithms for accessing, manipulating, and performing calculations on very large datasets. You also need a database and computational hardware optimized for making calculations on the many -possibly billions of- records in a big dataset.

With the current popularity of using the Python programming language for data analysis, you might imagine it comes with everything you need to do big data analysis quickly. But it doesn’t. In fact, the Python language “comes with” comparatively little for data analysis. By design, Python was created with a relatively small set of core features—just generic structures required for a programming language - which is augmented with a standard library providing many basic computing functions. One intention of this minimal design was that many additional libraries of functions would be created later to add the functions required for various applications like database access. Many libraries have been created and Ibis is one such library.

Ibis - SQL Access for Python Programmers

Libraries for accessing databases with Python have existed before Ibis. However, most of these Python libraries implement database access languages that have been designed for writing data queries. SQL is the most common example of such a language. These libraries provide a workable solution, however SQL is a very different language than Python. Shifting to writing a database query in SQL while coding in Python is inconvenient for a few reasons.

Imagine writing a fiction novel in English except all the dialog is written in Chinese. Even if you are fluent in both languages, shifting back and forth is likely to slow the writing process. It would also be harder for the people proofreading and typesetting your novel to do their jobs. Writing a data analysis program in Python while switching to SQL to access the data would be a similar exercise.

There are also Python libraries designed for analyzing datasets using only Python commands. The Pandas library is one that most Python developers are familiar with. Pandas allows you to load a dataset into computer memory and represent the records and fields of the database as rows and columns in a table. Pandas includes a set of functions for selecting, editing, and performing calculations on the data in the table. But Pandas loads the entire dataset into memory and many big data datasets are far too large to fit into memory. Analysis of big data sets needs to be performed “in place” where the database resides.

The people who wrote Ibis decided to use the familiar Pandas analysis model to enable big data analytics instead of a “foreign language” like SQL. They created a library that allows developers to write familiar Pandas-like instructions in Python. Ibis translates those instructions into the language required by the target database. Ibis is capable of interfacing with various database schemes including those that use SQL. This allows developers to write Pandas-like Python code to analyze big datasets without loading them into memory.

OmniSci - Accelerated SQL Database