Class PDFMarkedContentExtractor


  • public class PDFMarkedContentExtractor
    extends LegacyPDFStreamEngine
    This is an stream engine to extract the marked content of a pdf.
    • Field Detail

      • suppressDuplicateOverlappingText

        private boolean suppressDuplicateOverlappingText
      • markedContents

        private final java.util.List<PDMarkedContent> markedContents
      • currentMarkedContents

        private final java.util.Deque<PDMarkedContent> currentMarkedContents
      • characterListMapping

        private final java.util.Map<java.lang.String,​java.util.List<TextPosition>> characterListMapping
    • Constructor Detail

      • PDFMarkedContentExtractor

        public PDFMarkedContentExtractor()
                                  throws java.io.IOException
        Instantiate a new PDFTextStripper object.
        Throws:
        java.io.IOException
      • PDFMarkedContentExtractor

        public PDFMarkedContentExtractor​(java.lang.String encoding)
                                  throws java.io.IOException
        Constructor. Will apply encoding-specific conversions to the output text.
        Parameters:
        encoding - The encoding that the output will be written in.
        Throws:
        java.io.IOException
    • Method Detail

      • isSuppressDuplicateOverlappingText

        public boolean isSuppressDuplicateOverlappingText()
        Returns:
        the suppressDuplicateOverlappingText setting.
      • setSuppressDuplicateOverlappingText

        public void setSuppressDuplicateOverlappingText​(boolean suppressDuplicateOverlappingText)
        By default the class will attempt to remove text that overlaps each other. Word paints the same character several times in order to make it look bold. By setting this to false all text will be extracted, which means that certain sections will be duplicated, but better performance will be noticed.
        Parameters:
        suppressDuplicateOverlappingText - The suppressDuplicateOverlappingText setting to set.
      • within

        private boolean within​(float first,
                               float second,
                               float variance)
        This will determine of two floating point numbers are within a specified variance.
        Parameters:
        first - The first number to compare to.
        second - The second number to compare to.
        variance - The allowed variance.
      • xobject

        public void xobject​(PDXObject xobject)
      • processTextPosition

        protected void processTextPosition​(TextPosition text)
        This will process a TextPosition object and add the text to the list of characters on a page. It takes care of overlapping text.
        Overrides:
        processTextPosition in class LegacyPDFStreamEngine
        Parameters:
        text - The text to process.
      • getMarkedContents

        public java.util.List<PDMarkedContent> getMarkedContents()