Leveraging sparsity for efficient data manipulation: the GCXS sparse array format

Across many areas of science, sparsity is ubiquitous. Whether in the form of sparse data arrays, sparse linear operators, or the results from a sparsity-promoting optimization, data in the form of sparse vectors, matrices, and tensors are commonplace in scientific computing. In this post, I will highlight the pydata/sparse package and the recently added "Generalized Compressed Storage" (GCXS) sparse array format for working with large sparse data. For a more detailed demonstration of the GCXS sparse array format and operations, see the more technical section at the end of this post.

Quick introduction to sparse data

Brain map from three orthogonal angles colored red and orange on right, on the left Brain silhouette from three orthogonal angles with surrounding space colored white

Neuroimaging data is an interesting test case. MRI data is collected in a 3-dimensional grid and is often transformed into a standard space with specific dimensions. One standard, the Montreal Neurological Institute space (MNI152), has the dimensions (91, 109, 91). A brain in this space, however, typically occupies only a third of the full space, with the remainder of the array filled with zeros.

When working with more dimensions like time or research subjects, these datasets can get quite large and be computationally expensive. Shown above on the left is an averaged fractional anisotropy (FA) image. The image on the right illustrates all of the remaining space (in white), which is generally filled with zeros.

Sparse data structures, by definition, consist mostly of the same repeated value, typically zero. In the three matricies shown above, only 1% of the entries have a nonzero value. This feature can be exploited so as to reduce storage and runtime; sparse data is highly compressible because we can get away with only storing the nonzero values.

There are many ways to store the nonzeros of a sparse data structure, which are manifested in sparse storage formats and are efficient for certain operations. For example, a hashmap is well-suited for storing pairs of indices and their corresponding nonzero values and allows for efficient writing and mutating, but is poorly-suited for mathematical operations. By contrast, the compressed sparse row (CSR) storage format is well-suited for fast mathematical operations and compresses nicely but offers extremely inefficient writing and mutating capabilities.