Ibis and OmniSci; A powerful combination for easy access to fast big data analytics with Python.


# This article is intended for readers who are not experienced 
# Python developers. 
# If you are a developer and would like to know more about Ibis, 
# this article provides a more technical overview with code 
# examples.

In order to easily and quickly perform analysis on big data, you need to have a convenient way to write the algorithms for accessing, manipulating, and performing calculations on very large datasets. You also need a database and computational hardware optimized for making calculations on the many -possibly billions of- records in a big dataset.

With the current popularity of using the Python programming language for data analysis, you might imagine it comes with everything you need to do big data analysis quickly. But it doesn’t. In fact, the Python language “comes with” comparatively little for data analysis. By design, Python was created with a relatively small set of core features—just generic structures required for a programming language - which is augmented with a standard library providing many basic computing functions. One intention of this minimal design was that many additional libraries of functions would be created later to add the functions required for various applications like database access. Many libraries have been created and Ibis is one such library.


Ibis - SQL Access for Python Programmers


Libraries for accessing databases with Python have existed before Ibis. However, most of these Python libraries implement database access languages that have been designed for writing data queries. SQL is the most common example of such a language. These libraries provide a workable solution, however SQL is a very different language than Python. Shifting to writing a database query in SQL while coding in Python is inconvenient for a few reasons.


Imagine writing a fiction novel in English except all the dialog is written in Chinese. Even if you are fluent in both languages, shifting back and forth is likely to slow the writing process. It would also be harder for the people proofreading and typesetting your novel to do their jobs. Writing a data analysis program in Python while switching to SQL to access the data would be a similar exercise.


There are also Python libraries designed for analyzing datasets using only Python commands. The Pandas library is one that most Python developers are familiar with. Pandas allows you to load a dataset into computer memory and represent the records and fields of the database as rows and columns in a table. Pandas includes a set of functions for selecting, editing, and performing calculations on the data in the table. But Pandas loads the entire dataset into memory and many big data datasets are far too large to fit into memory. Analysis of big data sets needs to be performed “in place” where the database resides.


The people who wrote Ibis decided to use the familiar Pandas analysis model to enable big data analytics instead of a “foreign language” like SQL. They created a library that allows developers to write familiar Pandas-like instructions in Python. Ibis translates those instructions into the language required by the target database. Ibis is capable of interfacing with various database schemes including those that use SQL. This allows developers to write Pandas-like Python code to analyze big datasets without loading them into memory.


OmniSci - Accelerated SQL Database


So, Ibis handles the “easy access” part of this article’s title, where does OmniSci come in? OmniSci has developed a database and computational hardware solution that makes big data analysis incredibly fast. To understand why their technology is important, it’s helpful to consider how big data analytics is different from traditional computing workloads.


Back when computers were first being developed, the priority was to create machines that could do computations that were very difficult for humans to do with calculators: things like planetary orbital mechanics calculations for space flight. So computer central processing units, or CPUs, were designed to be able to execute one massive computation at a time.


But in big data analysis, the priority is to perform millions or billions of smaller calculations as fast as possible. For example calculating the average age from a dataset that includes half a billion birthdays. Over the years, CPUs have been modified to handle a broader set of tasks with additional, more complex, circuitry added and they still carry the legacy of their cores being a single, big, compute engine. Running big data analysis on these traditional CPUs can take hours, days or sometimes weeks.


Fortunately, about the same time that big data was on the rise so was computer generated imagery [CGI] and, surprisingly, the two are related computationally. To create computer generated graphics, it is necessary to calculate the color and brightness of millions of pixels in each image or video frame. The pixel characteristics are calculated based on how the light reflects off thousands or millions of surfaces in image. These calculations are a very high volume of relatively simple, repeated calculations, much like big data analysis.


Initially, CGI was created using traditional CPUs. But as more and more higher resolution images were created, it was taking much too long to complete the calculations. To solve this problem, the graphics industry created specialized graphics processing units, or GPUs, that have hundreds or thousands of small compute engines, or cores, well suited for simpler, repeated calculations. This same architecture is ideal for doing simple calculations on billions of records in a big dataset. OmniSci has developed an analytics database that harnesses the capabilities of both CPUs and GPUs which can greatly accelerate big data analytics.


OmniSci’s database uses SQL as its access language, bringing us back to Ibis and why these two technologies put together are a great solution for Python big data analytics. By using the Ibis library, developers can easily write data manipulation algorithms in Python and run them on an OmniSci database for amazingly fast computation speeds. We will be creating a series of technical posts on these technologies so check back for more information about using Ibis and OmniSci.


116 views

© QUANSIGHT 2020

  • LinkedIn - White Circle
  • Facebook - White Circle
  • Twitter - White Circle
  • White YouTube Icon
github.png