Manipulating pages¶
pikepdf presents the pages in a PDF through the pikepdf.Pdf.pages
property, which follows the list
protocol. As such page numbers begin at 0.
Since one of the most things people want to do is split and merge PDF pages, we’ll by exploring that.
Let’s look at a simple PDF that contains four pages.
In [1]: from pikepdf import Pdf
In [2]: pdf = Pdf.open('../tests/resources/fourpages.pdf')
How many pages?
In [3]: len(pdf.pages)
Out[3]: 4
Thanks to IPython’s rich Python object representations you can view the PDF while you work on it if you execute this example in a Jupyter notebook. Click the View PDF link below to view the file. You can view the PDF after each change you make. If you’re reading this documentation online or as part of distribution, you won’t see the rich representation.
In [4]: pdf
Out[4]: View PDF
You can also examine individual pages, which we’ll explore in the next section. Suffice to say that you can access pages by indexing them and slicing them.
In [5]: pdf.pages[-1].MediaBox
Out[5]: pikepdf.Array([ 0, 0, 612, 792 ])
Reversing the order of pages¶
Suppose the file was scanned backwards. We can easily reverse it in place - maybe it was scanned backwards, a common problem with automatic document scanners.
In [6]: pdf.pages.reverse()
In [7]: pdf
Out[7]: <pikepdf.Pdf description='../tests/resources/fourpages.pdf'>
Pretty nice, isn’t it? Of course, the pages in this file are in correct order, so let’s put them back.
In [8]: pdf.pages.reverse()
Deleting pages¶
Removing and adding pages is easy too.
In [9]: del pdf.pages[1:3] # Remove pages 2-3 labeled "second page" and "third page"
In [10]: pdf
Out[10]: <pikepdf.Pdf description='../tests/resources/fourpages.pdf'>
We’ve trimmed down the file to its essential first and last page.
Copying pages from other PDFs¶
Now, let’s add some content from another file. Because pdf.pages
behaves
like a list, we can use pages.extend()
on another file’s pages.
In [11]: pdf = Pdf.open('../tests/resources/fourpages.pdf')
In [12]: appendix = Pdf.open('../tests/resources/sandwich.pdf')
In [13]: pdf.pages.extend(appendix.pages)
We can use pages.insert()
to insert into one of more pages into a specific
position, bumping everything else ahead.
In [14]: graph = Pdf.open('../tests/resources/graph.pdf')
In [15]: pdf.pages.insert(1, graph.pages[0])
In [16]: len(pdf.pages)
Out[16]: 6
We can also replace specific pages with assignment (or slicing).
In [17]: congress = Pdf.open('../tests/resources/congress.pdf')
In [18]: pdf.pages[2] = congress.pages[0]
Saving changes¶
Naturally, you can save your changes with pikepdf.Pdf.save()
.
filename
can be a pathlib.Path
, which we accept everywhere. (Saving
is commented out to avoid upsetting the documentation generator.)
In [19]: pdf.save('output.pdf')
You may save a file multiple times, and you may continue modifying it after saving.
Split a PDF one page PDFs¶
All we need is a new PDF to hold the destination page.
In [20]: pdf = Pdf.open('../tests/resources/fourpages.pdf')
In [21]: for n, page in enumerate(pdf.pages):
....: dst = Pdf.new()
....: dst.pages.append(page)
....: dst.save('{:02d}.pdf'.format(n))
....:
Note
This example will transfer data associated with each page, so that every page stands on its own. It will not transfer some metadata associated with the PDF as a whole, such the list of bookmarks.
Merging a PDF from several files¶
You might be able to guess.
In [22]: from glob import glob
In [23]: pdf = Pdf.new()
In [24]: for file in glob('*.pdf'):
....: src = Pdf.open(file)
....: pdf.pages.extend(src.pages)
....:
In [25]: pdf.save('merged.pdf')
Note
This code sample does not deduplicate objects. The resulting file may be large if the source files have content in common.
Using counting numbers¶
Because PDF pages are usually numbered in counting numbers (1, 2, 3…),
pikepdf provides a convenience accessor .p()
that uses counting
numbers:
In [26]: pdf.pages.p(1) # The first page in the document
In [27]: pdf.pages[0] # Also the first page in the document
To avoid confusion, the .p()
accessor does not accept Python slices,
and .p(0)
raises an exception. It is also not possible to delete using it.
PDFs may define their own numbering scheme or different numberings for
different sections, such as using Roman numerals for an introductory section.
.pages
does not look up this information.
Note
Because of technical limitations in underlying libraries, pikepdf keeps the source PDF open when a content is copied from it to another PDF, even when all Python variables pointing to the source are removed. If a PDF is being assembled from many sources, then all of those sources are held open in memory. This memory can be released by saving and re-opening the PDF.
Warning
It’s possible to obtain page information through the PDF /Root
object as
well, but not recommend. The internal consistency of the various /Page
and /Pages
is not guaranteed when accessed in this manner, and in some
PDFs the data structure for these is fairly complex. Use the .pages
interface.