Hi,
I am trying to extract individual words from a text pdf and its coordinates (the coordinates of the individual words) for further configuration. When I tried the below snippet, the entire line is getting extracted at the same time. Is there any way to extract each words and its coordinates. If so please tell us a way to do so.
// the code snippet
Document pdfDocument = new Document(“D:\Files\testpdf.pdf”);
TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber();
pdfDocument.getPages().accept(textFragmentAbsorber);
TextFragmentCollection fragmentsCollection = textFragmentAbsorber.getTextFragments();
for (TextFragment textFragment : fragmentsCollection) {
System.out.println("Fragment: " + textFragment.getText() + " " + textFragment.getRectangle());
for (TextSegment textSegment : textFragment.getSegments()) {
System.out.println("TextSegment :- " + textSegment.getText());
}
}
Output for the above code and using the attached pdf.
Fragment: IMPORTANT NOTICE TO OUR POLICYHOLDERS THANK YOU FOR RENEWING YOUR 56.76,721.492910122742,526.376316429464,734.667912506927
TextSegment :- IMPORTANT NOTICE TO OUR POLICYHOLDERS THANK YOU FOR RENEWING YOUR
testpdf.pdf (85.7 KB)
As you can see both the textFragments and textSegments are yielding the same result.
Thankyou.