Last Updated on July 14, 2022 by Jay
In this short tutorial, I will walk you through how to split and merge PDF files using Python.
I once received a 20-page PDF bank statement, and I needed to forward just 3 of the pages to another party. I didn’t want to send the whole file because some pages contain personal information that I’m not comfortable sharing. So I needed a way to split a PDF file. Adobe Acrobat Pro DC allows you to split and merge PDF files, but at a cost like $200 USD/year, no thanks!
As usual, I turned to Python for this situation. Who doesn’t love a free solution?
Install Python library and load a PDF file into Python
To work with PDF files, we’ll use the
PyPDF4 library, use
pip install to get it.
pip install PyPDF4
We’ll instantiate (read: create) a PdfFileReader object to represent the PDF file. And later, we’ll need to instantiate a PdfFileWriter object to save PDF files. To read files sitting on my computer, I like to use the raw string (r-string) because of it’s simple syntax.
from PyPDF4 import PdfFileReader, PdfFileWriter pdf = PdfFileReader(r'C:\Users\JZ\Desktop\PythonInOffice\split_and_merge_pdf\data.pdf')
Now we have an object called
Extract basic info about the PDF file
Let’s check some basic info about this PDF file. It looks like the author used MS Word to create this 12-page document then converted into PDF. That sounds about right!
For demonstration, I’m going to pick some random pages to extract from the file, let’s say I want to get only pages 1-3, 5, 6, and 11-12. So we can construct a list to store the page numbers:
[1,2,3,4,5,11,12]. A heads-up – we’ll have to slightly modify this list later on.
Get pages from the PDF file
We can use
pdf.getPage() to get a specific page from the
pdf.getPage(0) is the first page of the PDF file, and
pdf.getPage(11) is the last page. Calling
pdf.getPage(12) will throw an “
index out of range” error because that means you are trying to access the 13th page in a 12-page file. Don’t mind all the gibberish displayed from
pdf.getPage(0), just know that this object is the first page. The .getPage() method allows us to split a PDF file into individual pages such that we can pick and choose then merge them into one file later on using Python.
Create and save a PDF file
Now that we have successfully extracted a page from PDF. To save it as a separate file, we’ll need to create a PdfFileWriter() object, add the page(s) into the object, and then save it to our computer. See the following code that executes the above steps. Also, note that ‘wb’ in the open() function refers to “write binary”.
pdf_writer = PdfFileWriter() pdf_writer.addPage(pdf.getPage(0)) with open(r'C:\Users\JZ\Desktop\PythonInOffice\split_and_merge_pdf\page_1.pdf', 'wb') as f: pdf_writer.write(f)
Merge multiple pages into the same PDF file
We can now go ahead and get all the desired pages from the PDF and merge them into one file. Remember the list of page numbers that we created earlier?
pages = [1,2,3,4,5,11,12]. We need to shift every number by 1 because of Python’s 0 based index. Just loop through all the numbers and subtract one from each number. Easy, right? The Pythonic way of doing this is called a list comprehension, or sometimes called a “one-liner for loop” in Python. It goes like this:
pages = [i-1 for i in pages]
Now we have the correct page index, and we can complete the PDF merging process.
pdf_writer = PdfFileWriter() for p in pages: pdf_writer.addPage(pdf.getPage(p)) with open(r'C:\Users\JZ\Desktop\PythonInOffice\split_and_merge_pdf\select_pages.pdf', 'wb') as f: pdf_writer.write(f)
Excited to see the results but only find out that we are welcomed by a weird error message
AttributeError: 'PdfFileWriter' object has no attribute 'stream'… It turned out that there is a bug in the PyPDF4 library, and every time you finish saving a PDF file and want to save another one, you have to re-create the
PdfFileReader() object. We saw this error because I saved a one page PDF file earlier. So don’t forget to always create a new
PdfFileReader() object before running the
What if I want to combine multiple PDF files???
Well, the process is the same as outlined above, so I won’t repeat it here. This is homework for you to solve along! Leave a comment below if you require help with this one, but I’ll give you a hint:
- Loop through the PDF files you want to merge
- Within each PDF file, loop through the pages, and add each page to the
- Save the new PDF by calling
Putting it together
Below is the full code that allows you to split and merge PDF files using Python:
from PyPDF4 import PdfFileReader, PdfFileWriter pdf = PdfFileReader(r'C:\Users\JZ\Desktop\PythonInOffice\split_and_merge_pdf\data.pdf') pdf_writer = PdfFileWriter() pdf.numPages pdf.getDocumentInfo() pages = [1,2,3,4,5,11,12] pages = [i-1 for i in pages] for p in pages: pdf_writer.addPage(pdf.getPage(p)) with open(r'C:\Users\JZ\Desktop\PythonInOffice\split_and_merge_pdf\select_pages.pdf', 'wb') as f: pdf_writer.write(f)
We have walked through how to work with PDF files using Python:
- Getting basic info about a PDF file
- Splitting a PDF file by extracting individual pages
- Merging pages into a new PDF file
Now enjoy this little tool, and oh, don’t forget to buy me a cup of coffee, because I just helped you save $200 per year!