Faster Than Pandas

Sharing is caring!

Last Updated on August 15, 2022 by Jay

Do you find the pandas library slow when handling large amounts of data and want to make your programs go faster? There are some best practices (e.g. vectorization, etc.) when using pandas. However, what we’ll discuss here will make our code run much faster even beyond adopting the best practices.

tl;dr – we need to use other data processing libraries in order to make our program go faster. Do not be discouraged as these libraries all have a similar syntax to pandas, so it’s pretty easy to learn how to use a new framework!

Why Pandas Is Slow

The pandas library is already pretty fast, thanks to the underlying numpy array data structure and C code. However, all Python code runs on a single CPU thread by default, which is what makes pandas slow. As the below screenshot shows, when I run a pandas code using the default settings, most of the CPU cores are just doing nothing; only a few (highlighted red) are at work. As a result, only 9% of the CPU is working.

CPU cores are idling

One way to make our code runs faster is by utilizing multiple CPU cores at once, i.e. multi-processing. So we want our CPU utilization to look like the following – every single core and the RAM are maxing close to 100%!

Multi-processing CPU

Three Data Analysis Libraries That Are Faster Than Pandas

We are going to take a brief look at the following three Python libraries which are capable of running at lightning speed:

  • polars – a fast dataframe library implemented in the Rust programming language using Apache Arrow Columnar Format memory model (I’m not going to pretend I know what this means…)
  • datatable – closely related to R’s data.table library
  • modin – uses all available CPU cores to run pandas, basically a drop-in replacement for pandas

Install Libraries

## polars
pip install polars

## datatable
pip install datatable

## modin
pip install modin[all]

Benchmark Testing

We will do several benchmark tests using the four libraries on the same machine, and see they compare.

We’ll use the following four randomly generated datasets:

Number of RowsNumber of ColumnsFile Size
1 million9120 MB
10 million91.6 GB
50 million97.9 GB
100 million916.4 GB
Sample data files

We created each dataset twice (with different random values). This is for testing the performance of the merge() function. I expect this function will be the most challenging one since it’s matching up two datasets with equal size. In the most extreme test, it will be matching two 50 million row datasets.

In case you are interested and want to perform the same tests, below is the code to generate these datasets:

import pandas as pd
import numpy as np

def create_dataset(n_rows, n_cols, f_name):
    df = pd.DataFrame(np.random.rand(n_rows, n_cols-3))
    df['integer_id'] = np.random.randint(1, 10000, size=n_rows)
    df['text_id'] = df['integer_id'].astype(str) + '_ID' 
    df['unique_id_text'] = df.index.astype(str) + '_ID'
    df = df.sample(frac =1)
    df.to_csv(f_name)
    
rows = 10_000_000
cols = 10

create_dataset(rows,cols,'10_million_1.csv')

Here are the tests we performed on the four datasets. We run each test twice and take the average time spent as the final result.

  1. Read CSV file
  2. Apply a function to a column
  3. A simple column calculation (vectorized)
  4. Vertically concatenate datasets (previously known as append)
  5. Merge datasets (using a common key column)
  6. Filter rows by value
  7. Filter rows and assign new values
  8. Groupby data then calculate sum of groups

Testing Results – Which Library Is Faster Than Pandas?

We will use pandas as the baseline performance metrics and compare them with the three libraries. It was difficult to replicate the apply and merge functions in the datatable library, so we skipped those in our tests. The below shows the time in seconds required for each function from the four libraries.

1 Million Rows

1 Million Row Test

10 Million Rows

10 Million Row Test

We see similar results for the tests performed on the 1 million rows dataset and 10 million rows dataset.

  • The polars library wins in all tests except the apply function, where modin is faster
  • Modin is pretty fast in the apply and concat functions, but pretty slow in others. It’s worth noting that modin is slower than pandas in many tests (merge, filter, groupby, etc)
  • Datatable is not bad and super fast in doing simple column calculations (took 0 seconds on the above tests, but I have to put a very small number that’s > 0 to indicate it actually took some time to calculate)

Now let’s see how these four libraries perform on larger datasets.

My computer is decent (12 cores CPU and 64GB RAM); however, modin crashed when running the merge tests for the 50 Million and 100 Million rows datasets.

50 Million Rows

50 Million Row Test

100 Million Rows

100 Million Row Test

From the tests on the larger datasets, we can also see that polars performs consistently better than all other libraries in most of our tests. Some of the highlights include:

  • ~17x faster than pandas when reading csv files
  • ~10x faster than pandas when merging two dataframes
  • ~2-3x faster than pandas for our other tests

Conclusion

Although we did not test every aspect of these four libraries, the operations in our tests are pretty common in data analysis work. Our results show that replacing pandas with polars will likely increase the speed of our Python program by at least 2-3 times.

Additional Resources

Create calculated columns in a dataframe

How To Use Pandas Groupby To Summarize Data

2 comments

Leave a Reply

Your email address will not be published. Required fields are marked *