Zarr - scalable storage of tensor data for use in parallel and distributed computing

SciPy 2019 submission.

Short summary

Many scientific problems involve computing over large N-dimensional typed arrays of data, and reading or writing data is often the major bottleneck limiting speed or scalability. The Zarr project is developing a simple, scalable approach to storage of such data in a way that is compatible with a range of approaches to distributed and parallel computing. We describe the Zarr protocol and data storage format, and the current state of implementations for various programming languages including Python. We also describe current uses of Zarr in malaria genomics, the Human Cell Atlas, and the Pangeo project.

Abstract

Background

Across a broad range of scientific disciplines, data are naturally represented and stored as N-dimensional typed arrays, also known as tensors. The volume of data being generated is outstripping our ability to analyse it, and scientific communities are looking for ways to leverage modern multi-core CPUs and distributed computing platforms, including cloud computing. Retrieval and storage of data is often the major bottleneck, and new approaches to data storage are needed to accelerate distributed computations and enable them to scale on a variety of platforms.

Methods

We have designed a new storage format and protocol for tensor data [1], and have released an open source Python implementation [2, 3]. Our approach builds on data storage concepts from HDF5 [4], particularly chunking and compression, and hierarchical organisation of datasets. Key design goals include: a simple protocol and format that can be implemented in other programming languages; support for multiple concurrent readers or writers; support for a variety of parallel computing environments, from multi-threaded execution on a single CPU to multi-process execution across a multi-node cluster; pluggable storage subsystem with support for file systems, key-value databases and cloud object stores; pluggable encoding subsystem with support for a variety of modern compressors.

Results

We illustrate the use of Zarr with examples from several scientific domains. Zarr is being used within the Pangeo project [5], which is building a community platform for big data geoscience. The Pangeo community have converted a number of existing climate modelling and satellite observation datasets to Zarr [6], and have demonstrated their use in computations using HPC and cloud computing environments. Within the MalariaGEN project [7], Zarr is used to store genome variation data from next-generation sequencing of natural populations of malaria parasites and mosquitoes [8] and these data are used as input to analyses of the evolution of these organisms in response to selective pressure from anti-malarial drugs and insecticides. Zarr is being used within the Human Cell Atlas (HCA) project [9], which is building a reference atlas of healthy human cell types. This project hopes to leverage this information to better understand the dysregulation of cellular states that underly human disease. The Human Cell Atlas uses Zarr as the output data format because it enables the project to easily generate matrices containing user-selected subsets of cells.

Conclusions

Zarr is generating interest across a range of scientific domains, and work is ongoing to establish a community process to support further development of the specifications and implementations in other programming languages [10, 11, 12] and building interoperability with a similar project called N5 [13]. Other packages within the PyData ecosystem, notably Dask [14], Xarray [15] and Intake [16], have added capability to read and write Zarr, and together these packages provide a compelling solution for large scale data science using Python [17]. Zarr has recently been presented in several venues, including a webinar for the ESIP Federation tech dive series [18], and a talk at the AGU Fall Meeting 2018 [19].

References

Authors

Project contributors are listed in alphabetical order by surname.