PdfExtractor TextSegments - one word per segment?

I am trying to get text information for each word in a pdf. I am using the trial of the java.pdf.kit. My word count is larger than my segment count. Shouldn’t I get one TextSegment per word?


Here is the code:

try {
// how many pages are there?
PdfFileInfo fileInfo = new PdfFileInfo(path + “text.pdf”);

System.out.println("page count " + fileInfo.getNumberofPages());

PdfExtractor extractor = new PdfExtractor();
extractor.bindPdf(path + “text.pdf”);

// extractor.setEndPage(page);
// extractor.setStartPage(page);

extractor.extractText();
TextSegment[] segments = extractor.getFormattedText();
System.out.println("word count " + extractor.getWordCount());
System.out.println("number of segments " + segments.length);

for (TextSegment text : segments) {

System.out.println(text.getText());
System.out.println(text.getFontName());
System.out.println(text.getFontSize());
System.out.println(text.getTextColor().toString());
System.out.println(text.getX());
System.out.println(text.getY());
}

// extractor.getText(path + “text.txt”);

} catch (java.io.IOException ioe) {
System.out.println(ioe.getMessage() + ioe.getStackTrace());
}

Hi Dave,


Thanks for using our products.

I
have tested the scenario and I am able to reproduce the same problem. For the
sake of correction, I have logged it in our issue tracking system as PDFKITJAVA-33243. We
will investigate this issue in details and will keep you updated on the status
of a correction.

We
apologize for your inconvenience.

We will be interested in using your product when this is fixed. Please keep us posted.

Hi Dave,


The development team will shortly start investigating this issue and as soon as we have made significant progress towards its resolution, we will update you in this forum thread. Please be patient and spare us little time.

Your patience and comprehension is greatly appreciated in this regard.