Extract individual words and its coordinates from a text PDF

karthi988 · January 27, 2021, 9:45am

Hi,
I am trying to extract individual words from a text pdf and its coordinates (the coordinates of the individual words) for further configuration. When I tried the below snippet, the entire line is getting extracted at the same time. Is there any way to extract each words and its coordinates. If so please tell us a way to do so.

// the code snippet
Document pdfDocument = new Document(“D:\Files\testpdf.pdf”);
TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber();
pdfDocument.getPages().accept(textFragmentAbsorber);
TextFragmentCollection fragmentsCollection = textFragmentAbsorber.getTextFragments();
for (TextFragment textFragment : fragmentsCollection) {
System.out.println("Fragment: " + textFragment.getText() + " " + textFragment.getRectangle());
for (TextSegment textSegment : textFragment.getSegments()) {
System.out.println("TextSegment :- " + textSegment.getText());
}
}

Output for the above code and using the attached pdf.
Fragment: IMPORTANT NOTICE TO OUR POLICYHOLDERS THANK YOU FOR RENEWING YOUR 56.76,721.492910122742,526.376316429464,734.667912506927
TextSegment :- IMPORTANT NOTICE TO OUR POLICYHOLDERS THANK YOU FOR RENEWING YOUR

testpdf.pdf (85.7 KB)

As you can see both the textFragments and textSegments are yielding the same result.
Thankyou.

karthi988 · January 27, 2021, 11:40am

Hi,
Also I have tried the following snippet in addtion to the previous one:

for (CharInfo charInfo : textSegment.getCharacters()) {
System.out.println(charInfo.getPosition());
}

This code gives only the coordinates of each characters present in the pdf and not the character itself.

asad.ali · January 27, 2021, 9:05pm

@karthi988

We were able to notice the similar behavior of the API in our environment while testing the scenario with Aspose.PDF for Java 21.1. Therefore, we have logged an investigation ticket as PDFJAVA-40102 in our issue tracking system. We will further look into its details and keep you posted with the status of its correction. Please be patient and spare us some time.

We are sorry for the inconvenience.