v1.1.1 (9th May 2012)
bugfix release to improve parsing of some PDFs
v1.1.0 (25th March 2012)
new PageState class for handling common state tracking in page receivers
see PageTextReceiver for example usage
various bugfixes to support reading more PDF dialects
v1.0.0 (16th January 2012)
support a new encryption variation
bugfix in PageTextRender (thanks Paul Gallagher)
v1.0.0.rc1 (19th December 2011)
performance optimisations (all by Bernerd Schaefer)
some improvements to text extraction from form xobjects
assume invalid font encodings are StandardEncoding
use binary mode when opening PDFs to stop ruby being helpful and transcoding
bytes for us
v1.0.0.beta1 (6th October 2011)
ensure inline images that contain “EI” are correctly parsed (thanks Bernard Schaefer)
fix parsing of inline image data
v0.12.0.alpha (28th August 2011)
small breaking changes to the page-based API - it's alpha for a reason
resource related methods on Page object return raw PDF objects
if the caller wants the resources wrapped in a more convenient Ruby object (like PDF::Reader::Font or PDF::Reader::FormXObject) will need to do so themselves
add support for RunLengthDecode filters (thanks Bernerd Schaefer)
add support for standard PDF encryption (thanks Evan Brunner)
add support for decoding stream with TIFF prediction
new PDF::Reader::FormXObject class to simplify working with form XObjects
v0.11.0.alpha (19th July 2011)
introduce experimental new page-based API
old API is deprecated but will continue to work with no warnings
add transparent caching of common objects to ObjectHash
v0.10.0 (6th July 2011)
support multiple receivers within a single pass over a source file
massive time saving when dealing with multiple receivers
v0.9.3 (2nd July 2011)
add PDF::Reader::Reference#hash method
improves behaviour of Reference objects when tehy're used as Hash keys
v0.9.2 (24th April 2011)
add basic support for fonts with Identity-V encoding.
bug: improve robustness of text extraction
thanks to Evan Arnold for reporting
bug: fix loading of nested resources on XObjects
thanks to Samuel Williams for reporting
bug: improve parsing of files with XRef object streams
v0.9.1 (21st December 2010)
force gem to only install on ruby 1.8.7 or higher
maintaining supprot for earlier versions takes more time than I have available at the moment
bug: fix parsing of obscure pdf name format
bug: fix behaviour when loaded in confunction with htmldoc gem
v0.9.0 (19th November 2010)
support for pdf 1.5+ files that use object and xref streams
support streams that use a flate filter with the predictor option
ensure all content instructions are parsed when split over multiple stream
thanks to Jack Rusher for reporting
Various string parsing bug
some character conversions to utf-8 were failing (thanks Andrea Barisani)
hashes with nested hex strings were tokenising wronly (thanks Evan Arnold)
escaping bug in tokenising of literal strings (thanks David Westerink)
Fix a bug that prevented PDFs with white space after the EOF marker from loading
thanks to Solomon White for reporting the issue
Add support for de-filtering some LZW compressed streams
thanks to Jose Ignacio Rubio Iradi for the patch
some small speed improvements
API CHANGE: PDF::Hash renamed to PDF::Reader::ObjectHash
having a class named Hash was confusing for users
v0.8.6 (27th August 2010)
new method: hash#page_references
returns references to all page objects, gives rapid access to objects for a given page
v0.8.5 (11th April 2010)
fix a regression introduced in 0.8.4.
Parameters passed to resource_font callback were inadvertently changed
v0.8.4 (30th March 2010)
fix parsing of files that use Form XObjects
thanks to Andrea Barisani for reporting the issue
fix two issues that caused a small number of characters to convert to Unicode incorrectly
thanks to Andrea Barisani for reporting the issue
require 'pdf-reader' now works a well as 'pdf/reader'
good practice to have the require file match the gem name
thanks to Chris O'Meara for highlighting this
v0.8.3 (14th February 2010)
Fix a bug in tokenising of hex strings inside dictionaries
Thanks to Brad Ediger for detecting the issue and proposing a solution
v0.8.2 (1st January 2010)
Fix parsing of files that use Form XObjects behind an indirect reference (thanks Cornelius Illi and Patrick Crosby)
Rewrote Buffer class to fix various speed issues reported over the years
On my sample file extracting full text reduced from 220 seconds to 9 seconds.
v0.8.1 (27th November 2009)
Added PDF::Hash#version. Provides access to the source file PDF version
v0.8.0 (20th November 2009)
Added PDF::Hash. It provides direct access to objects from a PDF file with an API that emulates the standard Ruby hash
v0.7.7 (11th September 2009)
Trigger callbacks contained in Form XObjects when we encounter them in a content stream
Fix inheritance of page resources to comply with section 3.6.2
v0.7.6 (28th August 2009)
Various bug fixes that increase the files we can successfully parse
Treat float and integer tokens differently (thanks Neil)
Correctly handle PDFs where the Kids element of a Pages dict is an indirect reference (thanks Rob Holland)
Fix conversion of PDF strings to Ruby strings on 1.8.6 (thanks Andrès Koetsier)
Fix decoding with ASCII85 and ASCIIHex filters (thanks Andrès Koetsier)
Fix extracting inline images from content streams (thanks Andrès Koetsier)
Fix extracting [ ] from content streams (thanks Christian Rishøj)
Fix conversion of text to UTF8 when the cmap uses bfrange (thanks Federico Gonzalez Lutteroth)
v0.7.5 (27th August 2008)
Fix a 1.8.7ism
v0.7.4 (7th August 2008)
Raise a MalformedPDFError if a content stream contains an unterminated string
Fix an bug that was causing an endless loop on some OSX systems
valid strings were incorrectly thought to be unterminated
thanks to Jeff Webb for playing email ping pong with me as I tracked this issue down
v0.7.3 (11th June 2008)
Add a high level way to get direct access to a PDF object, including a new executable: pdf_object
Fix a hard loop bug caused by a content stream that is missing a final operator
Significantly simplified the internal code for encoding conversions
Fixes YACC parsing bug that occurs on Fedora 8's ruby VM
New callbacks
page_count
pdf_version
Fix a bug that prevented a font's BaseFont from being recorded correctly
v0.7.2 (20th May 2008)
Throw an UnsupportedFeatureError if we try to open an encrypted/secure PDF
Correctly handle page content instruction sets with trailing whitespace
Represent PDF Streams with a new object, PDF::Reader::Stream
their really wasn't any point in separating the stream content from it's associated dict. You need both parts to correctly interpret the content
v0.7.1 (6th May 2008)
Non-page strings (ie. metadata, etc) are now converted to UTF-8 more accurately
Fixed a regression between 0.6.2 and 0.7 that prevented difference tables from being applied correctly when translating text into UTF-8
v0.7 (6th May 2008)
API INCOMPATIBLE CHANGE: any hashes that are passed to callbacks use symbols as keys instead of PDF::Reader::Name instances.
Improved support for converting text in some PDF files to unicode
Behave as expected if the Contents key in a Page Dict is a reference
Include some basic metadata callbacks
Don't interpret a comment token (%) inside a string as a comment
Small fixes to improve 1.9 compatibility
Improved our Zlib deflating to make it slightly more robust - still some more issues to work out though
Throw an UnsupportedFeatureError if a pdf that uses XRef streams is opened
Added an option to PDF::Reader#file and PDF::Reader#string to enable parsing of only parts of a PDF file(ie. only metadata, etc)
v0.6.2 (22nd March 2008)
Catch low level errors when applying filters to a content stream and raise a MalformedPDFError instead.
Added support for processing inline images
Support for parsing XRef tables that have multiple subsections
Added a few callbacks to improve the way we supply information on page resources
Ignore whitespace in hex strings, as required by the spec (section 3.2.3)
Use our “unknown character box” when a single character in an Identity-H string fails to decode
Support ToUnicode CMaps that use the bfrange operator
Tweaked tokenising code to ensure whitespace doesn't get in the way
v0.6.1 (12th March 2008)
Tweaked behaviour when we encounter Identity-H encoded text that doesn't have a ToUnicode mapping. We just replace each character with a little box.
Use the same little box when invalid characters are found in other encodings instead of throwing an ugly NoMethodError.
Added a method to RegisterReceiver that returns all occurrences of a callback
v0.6.0 (27th February 2008)
all text is now transparently converted to UTF-8 before being passed to the callbacks. before this version, text was just passed as a byte level copy of what was in the PDF file, which was mildly annoying with some encodings, and resulted in garbled text for Unicode encoded text.
Fonts that use a difference table are now handled correctly
fixed some 1.9 incompatible syntax
expanded RegisterReceiver class to record extra info
expanded rspec coverage
tweaked a README example
v0.5.1 (1st January 2008)
Several documentation tweaks
Improve support for parsing PDFs under windows (thanks to Jari Williamsson)
v0.5 (14th December 2007)
Initial Release