I am trying to get text information for each word in a pdf. I am using the trial of the java.pdf.kit. My word count is larger than my segment count. Shouldn’t I get one TextSegment per word?
Here is the code:
try {
// how many pages are there?
PdfFileInfo fileInfo = new PdfFileInfo(path + “text.pdf”);
System.out.println("page count " + fileInfo.getNumberofPages());
PdfExtractor extractor = new PdfExtractor();
extractor.bindPdf(path + “text.pdf”);
// extractor.setEndPage(page);
// extractor.setStartPage(page);
extractor.extractText();
TextSegment[] segments = extractor.getFormattedText();
System.out.println("word count " + extractor.getWordCount());
System.out.println("number of segments " + segments.length);
for (TextSegment text : segments) {
System.out.println(text.getText());
System.out.println(text.getFontName());
System.out.println(text.getFontSize());
System.out.println(text.getTextColor().toString());
System.out.println(text.getX());
System.out.println(text.getY());
}
// extractor.getText(path + “text.txt”);
} catch (java.io.IOException ioe) {
System.out.println(ioe.getMessage() + ioe.getStackTrace());
}