Convert PDF to Excel with Python

In this tutorial, we’ll take a look at how to convert PDF to Excel with Python. If you work with data, the chances are that you have had, or will have to deal with data stored in a .pdf file. It’s difficult to copy a table from PDF and paste it directly into Excel. In most cases, what we copy from the PDF file is text, instead of formatted Excel tables. Therefore, when pasting the data into Excel, we see a chunk of text squeezed into one cell.

Of course, we don’t want to copy and paste individual values one by one into Excel. There are several commercial software that allows PDF to Excel conversion, but they charge a hefty fee. If you are willing to learn a little bit of Python, it takes less than 10 lines of code to achieve a reasonably good result.

We’ll extract the COVID-19 cases by country from the WHO’s website. I’m attaching it here in case the source file gets removed later.

Step 1. Install Python library and Java

tabula-py is a Python wrapper of tabula-java, which can read tables in PDF file. It means that we need to install Java first. The installation takes about 1 minute, and you can follow this link to find the Java installation file for your operating system: https://java.com/en/download/help/download_options.xml.

Once you have Java, install tabula-py with pip:

pip install tabula-py

We are going to extract the table on page 3 of the PDF file. tabula.read_pdf() returns a list of dataframes. For some reason, tabula detected 8 tables on this page, looking through them, we see that the second table is what we want to extract. Thus we specify that we want to get the second element of that list using [1].

import tabula
df = tabula.read_pdf('data.pdf', pages = 3, lattice = True)[1]

If this is your first time installing Java and tabula-py, you might get the following error message when running the above 2 lines of code:

tabula.errors.JavaNotFoundError: `java` command is not found from this Python process.Please ensure Java is installed and PATH is set for `java`

Which is due to Java folder is not in the PATH system variable. Simply add your Java installation folder to the PATH variable. I used the default installation, so the Java folder is C:\Program Files (x86)\Java\jre1.8.0_251\bin on my laptop.

Add Java to PATH

Now the script should run.

By default, tabula-py will extract tables from PDF file into a pandas dataframe. Let’s take a look at the data by inspecting the first 10 rows with .head(10):

Table extracted from PDF

We immediately see two problems with this unprocessed table: the header row contains weird letters “\r”, and there are many NaN values. We’ll have to do a little bit further clean up to make the data useful.

Step 2. Clean up the header row

Let’s first clean up the header row. df.columns returns the dataframe header names.

Dataframe header

We can replace the “\r” in the header by doing the following:

df.columns = df.columns.str.replace('\r', ' ')

.str returns all of the string values of the header, then we can perform the .replace() function to replace “\r” with a space. Then, we assign the clean string values back to the dataframe’s header (columns)

Step 3. Remove NaN values

Next, we’ll clean those NaN values, which were created by the function tabula.read_pdf(), for whenever a particular cell is blank. These values cause troubles for us when doing data analysis, so most of the time we’ll remove them. Glancing through the table, it appears we can remove the rows that contain NaN values without losing any data points. Lucky for us, pandas provide a convenient way to remove rows with NaN values.

data = df.dropna()
data.to_excel('data.xlsx')
Clean dataframe

Putting it all together

import tabula
df = tabula.read_pdf('data.pdf', pages = 3, lattice = True)[1]

df.columns = df.columns.str.replace('\r', ' ')
data = df.dropna()
data.to_excel('data.xlsx')

Now you see, it takes only 5 lines of code to convert PDF to Excel with Python. It’s simple and powerful. The best part? You control what you want to extract, keep, and change!