How to open large csv or text files using Python

Sharing is caring!

Last Updated on April 9, 2023 by Jay

In this tutorial, we’ll learn how to open very large csv or text files using Python. My colleague once received a large csv file of 8GB. He wanted to take a peek at the content, but he couldn’t open it using any program he tried, Notepad, Excel, etc. The file was simply too large for the program to load.

So he reached out to me for help. His request was relatively simple: open up the 8GB large csv file and potentially look at data in the first couple of thousand lines. This seemingly impossible task is easy when you pick the right tool – Python.

Our Approach

We’ll first explore ways to examine the content of the large csv. Then we’ll break down the large file into small files, so the data become workable (openable?) in Excel.

For demonstration purposes, we won’t use an 8GB large csv file; instead, we’ll use a smaller file with only 2,600 rows of data. If you want to follow along, feel free to download the example from here.

The Solution

As always, we start by importing the necessary libraries, in this exercise, we just need pandas.

import pandas as pd 
df = pd.read_csv('large_data.csv')
df_small = pd.read_csv('large_data.csv', nrows = 1000)

pd.read_csv() allows us to read any .csv file into Python, regardless of the file size – more on this point later. A csv file is a comma-separated values file, which is basically a text file. This pandas method has an optional argument nrows, which specifies the number of rows you want to load.

The first variable df loaded everything inside the csv file. Whereas the second variable df_small loaded only the first 1,000 rows of data. We can check that by calling the .shape attribute.

As shown above, the “large_data.csv” file contains 2618 rows and 11 columns of data in total. And we can also confirm that in the df_small variable, we only loaded the first 1,000 rows of data, also 11 columns.

Typing df_small.head() shows the first 5 rows of data within the df_small dataframe. We can take a peek into the large file this way!

Next, what if we want to be able to open the data file using only Excel? You know, people like Excel so we have to stick to it!

Although we can’t use magic to allow Excel to open this 8GB file, we can “divide and conquer” by breaking down it into smaller files. For example, 8 files with 1GB each, or 16 files with 500MB each. A modern version of Excel can handle those file sizes easily.

This time, we’ll load the dataframe slightly differently – with an optional argument chunksize. Again, for demonstration purpose, we are using a much smaller file.

df = pd.read_csv('large_data.csv', chunksize = 900)

Without getting into too much technical detail, the chunksize argument allows us to load data in chunks, with each chunk having a size of 900 rows of data in our example. The number of chunks is determined automatically by the program. Given that our csv file contains 2,618 rows, we expect to see 2618 / 900 = 2.9, which means 3 chunks in total. The first two chunks contain 900 rows, and the last chunk contains the remaining 818 rows.

Let’s see if that’s true.

We’ve successfully loaded and broken down one file into smaller pieces, next let’s save them into smaller individual files.

i = 1
for file in df:
    file.to_csv(f'file_{i}.csv')
    i += 1

Putting it All Together

import pandas as pd 

df = pd.read_csv('large_data.csv', chunksize = 900)
df_small = pd.read_csv('large_data.csv', nrows = 1000)

i = 1
for file in df:
    print(file.shape)
    file.to_csv(f'file_{i}.csv')
    i += 1

We used only 8 lines of code to solve what seems impossible to achieve in Excel. I hope you are starting to love Python ❤️????!

3 comments

Leave a Reply

Your email address will not be published. Required fields are marked *