Last Updated on August 15, 2022 by Jay
Do you find the pandas library slow when handling large amounts of data and want to make your programs go faster? There are some best practices (e.g. vectorization, etc.) when using pandas. However, what we’ll discuss here will make our code run much faster even beyond adopting the best practices.
tl;dr – we need to use other data processing libraries in order to make our program go faster. Do not be discouraged as these libraries all have a similar syntax to pandas, so it’s pretty easy to learn how to use a new framework!
Why Pandas Is Slow
The pandas library is already pretty fast, thanks to the underlying numpy array data structure and C code. However, all Python code runs on a single CPU thread by default, which is what makes pandas slow. As the below screenshot shows, when I run a pandas code using the default settings, most of the CPU cores are just doing nothing; only a few (highlighted red) are at work. As a result, only 9% of the CPU is working.
One way to make our code runs faster is by utilizing multiple CPU cores at once, i.e. multi-processing. So we want our CPU utilization to look like the following – every single core and the RAM are maxing close to 100%!
Three Data Analysis Libraries That Are Faster Than Pandas
We are going to take a brief look at the following three Python libraries which are capable of running at lightning speed:
- polars – a fast dataframe library implemented in the Rust programming language using Apache Arrow Columnar Format memory model (I’m not going to pretend I know what this means…)
- datatable – closely related to R’s data.table library
- modin – uses all available CPU cores to run pandas, basically a drop-in replacement for pandas
## polars pip install polars ## datatable pip install datatable ## modin pip install modin[all]
We will do several benchmark tests using the four libraries on the same machine, and see they compare.
We’ll use the following four randomly generated datasets:
|Number of Rows||Number of Columns||File Size|
|1 million||9||120 MB|
|10 million||9||1.6 GB|
|50 million||9||7.9 GB|
|100 million||9||16.4 GB|
We created each dataset twice (with different random values). This is for testing the performance of the merge() function. I expect this function will be the most challenging one since it’s matching up two datasets with equal size. In the most extreme test, it will be matching two 50 million row datasets.
In case you are interested and want to perform the same tests, below is the code to generate these datasets:
import pandas as pd import numpy as np def create_dataset(n_rows, n_cols, f_name): df = pd.DataFrame(np.random.rand(n_rows, n_cols-3)) df['integer_id'] = np.random.randint(1, 10000, size=n_rows) df['text_id'] = df['integer_id'].astype(str) + '_ID' df['unique_id_text'] = df.index.astype(str) + '_ID' df = df.sample(frac =1) df.to_csv(f_name) rows = 10_000_000 cols = 10 create_dataset(rows,cols,'10_million_1.csv')
Here are the tests we performed on the four datasets. We run each test twice and take the average time spent as the final result.
- Read CSV file
- Apply a function to a column
- A simple column calculation (vectorized)
- Vertically concatenate datasets (previously known as append)
- Merge datasets (using a common key column)
- Filter rows by value
- Filter rows and assign new values
- Groupby data then calculate sum of groups
Testing Results – Which Library Is Faster Than Pandas?
We will use pandas as the baseline performance metrics and compare them with the three libraries. It was difficult to replicate the apply and merge functions in the datatable library, so we skipped those in our tests. The below shows the time in seconds required for each function from the four libraries.
1 Million Rows
10 Million Rows
We see similar results for the tests performed on the 1 million rows dataset and 10 million rows dataset.
- The polars library wins in all tests except the apply function, where modin is faster
- Modin is pretty fast in the apply and concat functions, but pretty slow in others. It’s worth noting that modin is slower than pandas in many tests (merge, filter, groupby, etc)
- Datatable is not bad and super fast in doing simple column calculations (took 0 seconds on the above tests, but I have to put a very small number that’s > 0 to indicate it actually took some time to calculate)
Now let’s see how these four libraries perform on larger datasets.
My computer is decent (12 cores CPU and 64GB RAM); however, modin crashed when running the merge tests for the 50 Million and 100 Million rows datasets.
50 Million Rows
100 Million Rows
From the tests on the larger datasets, we can also see that polars performs consistently better than all other libraries in most of our tests. Some of the highlights include:
- ~17x faster than pandas when reading csv files
- ~10x faster than pandas when merging two dataframes
- ~2-3x faster than pandas for our other tests
Although we did not test every aspect of these four libraries, the operations in our tests are pretty common in data analysis work. Our results show that replacing pandas with polars will likely increase the speed of our Python program by at least 2-3 times.
Create calculated columns in a dataframe
Hello, nice post!
Can you share the code used for all test?
even i would like to see that