Last Updated on July 14, 2022 by Jay
In this short tutorial, I will walk you through how to split and merge PDF files using Python.
I once received a 20-page PDF bank statement, and I needed to forward just 3 of the pages to another party. I didn’t want to send the whole file because some pages contain personal information that I’m not comfortable sharing. So I needed a way to split a PDF file. Adobe Acrobat Pro DC allows you to split and merge PDF files, but at a cost like $200 USD/year, no thanks!
As usual, I turned to Python for this situation. Who doesn’t love a free solution?
Install Python library and load a PDF file into Python
To work with PDF files, we’ll use the PyPDF4
library, use pip install
to get it.
pip install PyPDF4
We’ll instantiate (read: create) a PdfFileReader object to represent the PDF file. And later, we’ll need to instantiate a PdfFileWriter object to save PDF files. To read files sitting on my computer, I like to use the raw string (r-string) because of it’s simple syntax.
from PyPDF4 import PdfFileReader, PdfFileWriter
pdf = PdfFileReader(r'C:\Users\JZ\Desktop\PythonInOffice\split_and_merge_pdf\data.pdf')
Now we have an object called pdf
to represent the actual PDF file. And you can access the information contained in the PDF. In this example, I’m using the same WHO Covid report that I used in another tutorial (convert PDF to Excel using Python). Feel free to download the PDF to follow along.
Extract basic info about the PDF file
Let’s check some basic info about this PDF file. It looks like the author used MS Word to create this 12-page document then converted into PDF. That sounds about right!
pdf.numPages
pdf.getDocumentInfo()
For demonstration, I’m going to pick some random pages to extract from the file, let’s say I want to get only pages 1-3, 5, 6, and 11-12. So we can construct a list to store the page numbers: [1,2,3,4,5,11,12]
. A heads-up – we’ll have to slightly modify this list later on.
Get pages from the PDF file
We can use pdf.getPage()
to get a specific page from the pdf
object. Just keep in mind that Python index starts from 0 instead of 1, so many Python libraries follow this convention. pdf.getPage(0)
is the first page of the PDF file, and pdf.getPage(11)
is the last page. Calling pdf.getPage(12)
will throw an “index out of range
” error because that means you are trying to access the 13th page in a 12-page file. Don’t mind all the gibberish displayed from pdf.getPage(0)
, just know that this object is the first page. The .getPage() method allows us to split a PDF file into individual pages such that we can pick and choose then merge them into one file later on using Python.
Create and save a PDF file
Now that we have successfully extracted a page from PDF. To save it as a separate file, we’ll need to create a PdfFileWriter() object, add the page(s) into the object, and then save it to our computer. See the following code that executes the above steps. Also, note that ‘wb’ in the open() function refers to “write binary”.
pdf_writer = PdfFileWriter()
pdf_writer.addPage(pdf.getPage(0))
with open(r'C:\Users\JZ\Desktop\PythonInOffice\split_and_merge_pdf\page_1.pdf', 'wb') as f:
pdf_writer.write(f)
Merge multiple pages into the same PDF file
We can now go ahead and get all the desired pages from the PDF and merge them into one file. Remember the list of page numbers that we created earlier? pages = [1,2,3,4,5,11,12]
. We need to shift every number by 1 because of Python’s 0 based index. Just loop through all the numbers and subtract one from each number. Easy, right? The Pythonic way of doing this is called a list comprehension, or sometimes called a “one-liner for loop” in Python. It goes like this:
pages = [i-1 for i in pages]
Now we have the correct page index, and we can complete the PDF merging process.
pdf_writer = PdfFileWriter()
for p in pages:
pdf_writer.addPage(pdf.getPage(p))
with open(r'C:\Users\JZ\Desktop\PythonInOffice\split_and_merge_pdf\select_pages.pdf', 'wb') as f:
pdf_writer.write(f)
Excited to see the results but only find out that we are welcomed by a weird error message AttributeError: 'PdfFileWriter' object has no attribute 'stream'
… It turned out that there is a bug in the PyPDF4 library, and every time you finish saving a PDF file and want to save another one, you have to re-create the PdfFileReader()
object. We saw this error because I saved a one page PDF file earlier. So don’t forget to always create a new PdfFileReader()
object before running the PdfFileWriter().write()
method.
What if I want to combine multiple PDF files???
Well, the process is the same as outlined above, so I won’t repeat it here. This is homework for you to solve along! Leave a comment below if you require help with this one, but I’ll give you a hint:
- Loop through the PDF files you want to merge
- Within each PDF file, loop through the pages, and add each page to the
PdfFileWriter
object. - Save the new PDF by calling
PdfFileWriter.write()
method
Putting it together
Below is the full code that allows you to split and merge PDF files using Python:
from PyPDF4 import PdfFileReader, PdfFileWriter
pdf = PdfFileReader(r'C:\Users\JZ\Desktop\PythonInOffice\split_and_merge_pdf\data.pdf')
pdf_writer = PdfFileWriter()
pdf.numPages
pdf.getDocumentInfo()
pages = [1,2,3,4,5,11,12]
pages = [i-1 for i in pages]
for p in pages:
pdf_writer.addPage(pdf.getPage(p))
with open(r'C:\Users\JZ\Desktop\PythonInOffice\split_and_merge_pdf\select_pages.pdf', 'wb') as f:
pdf_writer.write(f)
Conclusion
We have walked through how to work with PDF files using Python:
- Getting basic info about a PDF file
- Splitting a PDF file by extracting individual pages
- Merging pages into a new PDF file
Now enjoy this little tool, and oh, don’t forget to buy me a cup of coffee, because I just helped you save $200 per year!