Parallelize and Accelerate Your Pandas DataFrames

A Python library to achieve that with only one line of code

Nate Dong, Ph.D.
4 min readOct 7, 2020
Photo by David Becker on Unsplash

Pandas Library

Pandas is the most popular library for data wrangling and processing in Python. It has a lot of different functions that make data manipulation and transformation quite simple and flexible. But Pandas is known to have issues about scalability and efficiency.

By default, Pandas executes its functions as a single process using only one CPU core, so it does not natively take advantage of all of the cores on your system and computing power effectively. When it comes to handling large datasets or extensive calculations, Pandas will become very slow.

Modin Library

Modin is a new lightweight library designed to parallelize and accelerate Pandas DataFrames by automatically distributing the computation across all of the system’s available CPU cores. With that, Modin claims to be able to get nearly linear speed-up to the number of CPU cores on your system for Pandas DataFrames of any size [1].

Modin simply divides a Python DataFrame into different parts such that each part can be sent to a different CPU core. Modin partitions the DataFrame across both the rows and the columns, which makes Modin’s parallel processing highly scalable to the DataFrames of any size and shape.

Pandas DataFrame vs. Modin DataFrame

The above figure is an example. As illustrated, a Pandas DataFrame is stored as one block and can only be sent to one CPU core. A Modin DataFrame is partitioned across rows and columns, and each partition can be sent to a different CPU core up to the max cores on the system.

Modin is a drop-in replacement for Pandas and provides full parallelization of most of the Pandas APIs. It provides seamless integration and compatibility with your Pandas code. To use Modin, you do not need to know how many cores your system has or data distribution details. In fact, you can continue using your existing Pandas code. Once you have changed your import statement, you are ready to use Modin just like how you use Pandas.

Modin Installation

You can install Modin using pip or conda.

To install using pip:

pip install modin

To install using conda:

conda install -c conda-forge modin

Computation Engine

Modin is actually a layer of abstraction over Ray and Dask, two different parallel computation engines. If you do not have Ray or Dask installed, you need to install Modin together with one of the computation engines. You can change the install statement of Modin to include the computation engine of your choice by changing your pip install to include one of the following engines:

pip install modin[ray]    # Install Modin dependencies Ray 
pip install modin[dask] # Install Modin dependencies Dask
pip install modin[all] # Install both

Once the Modin library is installed, you need to replace the Pandas import statement in your code. The statement import pandas as pd can be replaced by import modin.pandas as pd.

Modin is smart enough to detect your installed engine, but if you want to choose a specific compute engine to run on, you can set the environment variable MODIN_ENGINE and Modin will perform computation with that engine:

export MODIN_ENGINE = ray    # Modin will run on Ray
export MODIN_ENGINE = dask # Modin will run on Dask

This can also be done within the code before you import Modin:

import osos.environ["MODIN_ENGINE"] = "ray"    # Modin will run on Ray
os.environ["MODIN_ENGINE"] = "dask" # Modin will run on Dask
import modin.pandas as pd

Performance Comparison

We first load a CSV file (1.2GB) on a laptop with 4 CPU cores. The code itself is exactly the same for both Pandas and Modin.

# Read the dataset with Pandas
import time
import pandas as pd
s = time.time()
df = pd.read_csv("my_dataset.csv")
e = time.time()
print("Pandas Loading Time = {}".format(e-s)) # Read the dataset with Modin
import modin.pandas as pd
s = time.time()
df = pd.read_csv("my_dataset.csv")
e = time.time()
print("Modin Loading Time = {}".format(e-s))

Output:
Pandas Loading Time = 24.7s
Modin Loading Time = 6.6s

According to the output, Modin is able to achieve a speed-up of 3.7x for data loading.

In the following code, we concatenate the same DataFrame to itself 5 times.

import time
import pandas as pd
df = pd.read_csv("my_dataset.csv")

s = time.time()
df = pd.concat([df for _ in range(5)])
e = time.time()
print("Pandas Concatenation Time = {}".format(e-s))

import modin.pandas as pd
df = pd.read_csv("my_dataset.csv")

s = time.time()
df = pd.concat([df for _ in range(5)])
e = time.time()
print("Modin Concatenation Time = {}".format(e-s))

Output:
Pandas Concatenation Time = 3.69s
Modin
Concatenation Time = 0.058s

It is a 63.6x speed-up! We can see large gains by efficiently distributing the computation across the entire machine.

Conclusion

Modin is actively developed and has a bright future that includes plans to provide a SQL API on top of Pandas. Enjoy faster data analysis by using this simple and lightweight library.

For more information, examples, and future work, please check out the Modin documentation.

Thank you for reading!

References

[1] George Seif, How to Speed up Pandas by 4x with one line of code, https://www.kdnuggets.com/2019/11/speed-up-pandas-4x.html

--

--