Class WordExtractor

All Implemented Interfaces:
Closeable, AutoCloseable

public final class WordExtractor extends POIOLE2TextExtractor
Class to extract the text from a Word Document. You should use either getParagraphText() or getText() unless you have a strong reason otherwise.
Author:
Nick Burch
  • Constructor Details

    • WordExtractor

      public WordExtractor(InputStream is) throws IOException
      Create a new Word Extractor
      Parameters:
      is - InputStream containing the word file
      Throws:
      IOException
    • WordExtractor

      public WordExtractor(POIFSFileSystem fs) throws IOException
      Create a new Word Extractor
      Parameters:
      fs - POIFSFileSystem containing the word file
      Throws:
      IOException
    • WordExtractor

      public WordExtractor(DirectoryNode dir) throws IOException
      Throws:
      IOException
    • WordExtractor

      public WordExtractor(HWPFDocument doc)
      Create a new Word Extractor
      Parameters:
      doc - The HWPFDocument to extract from
  • Method Details

    • main

      public static void main(String[] args) throws IOException
      Command line extractor, so people will stop moaning that they can't just run this.
      Throws:
      IOException
    • getParagraphText

      public String[] getParagraphText()
      Get the text from the word file, as an array with one String per paragraph
    • getFootnoteText

      public String[] getFootnoteText()
    • getMainTextboxText

      public String[] getMainTextboxText()
    • getEndnoteText

      public String[] getEndnoteText()
    • getCommentsText

      public String[] getCommentsText()
    • getParagraphText

      protected static String[] getParagraphText(Range r)
    • getHeaderText

      @Deprecated public String getHeaderText()
      Deprecated.
      3.8 beta 4
      Grab the text from the headers
    • getFooterText

      @Deprecated public String getFooterText()
      Deprecated.
      3.8 beta 4
      Grab the text from the footers
    • getTextFromPieces

      public String getTextFromPieces()
      Grab the text out of the text pieces. Might also include various bits of crud, but will work in cases where the text piece -> paragraph mapping is broken. Fast too.
    • getText

      public String getText()
      Grab the text, based on the WordToTextConverter. Shouldn't include any crud, but slower than getTextFromPieces().
      Specified by:
      getText in class POITextExtractor
      Returns:
      All the text from the document
    • stripFields

      public static String stripFields(String text)
      Removes any fields (eg macros, page markers etc) from the string.