Search Text with Rectangle

Hello,

We are using Aspose PDF Java library 21.3 to get word coordinate in a PDF document.

We first use ParagraphAbsorber to find MarkupParagraph, and from MarkupParagraph to get Point[] of the polygon to create a Rectangle of this paragraph, and then use TextFragmentAbsorber and TextSearchOptions to search a word in this rectangular area which will return a TextFragmentCollection for matched words in this rectangular area.

But it seems sometimes the TextFragmentCollection returned is empty though the word is present in the rectangular area (paragraph). and I have to drag lower left point and upper right point further to be able to find the word. However, once the word is found, its coordinates seems to be within the rectangular area without the need to be enlarged.

Is this a bug?

The rectangular enlarge also impacts the words that are able to to be found without the need of enlargement. Those words coordinates seems to be moving along with the rectangular size, which does not make sense to me.

Thanks for your help on this matter.

@rye3000

Can you please share more details with sample file, code and snapshots of issue so that we may try to reproduce the same on our end.

Using the following code, we can obtain the coordinate for a paragrpah
for (PageMarkup markup : paragraphAbsorber.getPageMarkups()) {
for (MarkupSection section : markup.getSections()) {
for (MarkupParagraph paragraph : section.getParagraphs()) {
var coordinate = calculateCoordinate(paragraph);
}
}
}

And using the following code, we can obtain the coordinate for a word
var tokenFragmentAbsorber = new TextFragmentAbsorber(tokenText);
var tokenSearchOptions = new TextSearchOptions(false);
tokenFragmentAbsorber.setTextSearchOptions(tokenSearchOptions);
pageObject.accept(tokenFragmentAbsorber);

        TextFragmentCollection tokenCollection = tokenFragmentAbsorber.getTextFragments();

For the sample document in the second page, for this paragraph,

  1. ANONYMIZATION METHODOLOGY…3
    The coordinate for this paragraph is
    {
    “lowerLeftX” : 72.024,
    “lowerLeftY” : 679.07536000824,
    “upperRightX” : 539.697279980659,
    “upperRightY” : 691.219359966278
    }

However, if we extract ‘.’ and obtain the coordinate for it (there are many of them), it is
LLX: 276.05,
LLY: 678.700000009537,
URX: 278.832079990387,
URY: 690.843999967575

As can be seen, the LLY is outside the rectangular area of the paragraph. Most of the ‘.’ textfragment have similar issues.

Why does this happen? It does not make sense that a textfragment inside paragraph falls outisde paragraph.

testExtraction.pdf (402.5 KB)

@rye3000

This code does not contain the definition for calculateCoordinate(paragraph) method so I request you to share runnable code so that we can reproduce the issue on our end.

calculateCoordinate is as follows,
private Coordinate calculateCoordinate(MarkupParagraph paragraph) {
Point[] points = paragraph.getPoints();
if (points != null && points.length > 0) {
// Have to increase rectangle to search the token
// A question/bug is raised to Aspose
var llx = Arrays.stream(points).mapToDouble(Point::getX).min().getAsDouble();
var lly = Arrays.stream(points).mapToDouble(Point::getY).min().getAsDouble();
var urx = Arrays.stream(points).mapToDouble(Point::getX).max().getAsDouble();
var ury = Arrays.stream(points).mapToDouble(Point::getY).max().getAsDouble();
return new Coordinate(llx, lly, urx, ury);
}

    return new Coordinate(0,0,0,0 );
}

@rye3000

I request you to share complete and runnable code because still the detail of Coordinate class is missing.

public class Coordinate implements Comparable {
private final double lowerLeftX;
private final double lowerLeftY;
private final double upperRightX;
private final double upperRightY;

public Coordinate(double lowerLeftX, double lowerLeftY, double upperRightX, double upperRightY) {
    this.lowerLeftX = lowerLeftX;
    this.lowerLeftY = lowerLeftY;
    this.upperRightX = upperRightX;
    this.upperRightY = upperRightY;
}

public double getLowerLeftX() {
    return this.lowerLeftX;
}

public double getLowerLeftY() {
    return this.lowerLeftY;
}

public double getUpperRightX() {
    return this.upperRightX;
}

public double getUpperRightY() {
    return this.upperRightY;
}

public boolean enclose(Coordinate coordinate) {
    return this.lowerLeftX <= coordinate.lowerLeftX
            && this.lowerLeftY <= coordinate.lowerLeftY
            && this.upperRightX >= coordinate.upperRightX
            && this.upperRightY >= coordinate.upperRightY;
}

public String toString(){
    return "(" + this.lowerLeftX + ", " + this.lowerLeftY + "), (" + this.upperRightX + ", " + this.upperRightY + ")";
}

@Override
public boolean equals(Object o) {
    if (this == o) return true;
    if (o == null || getClass() != o.getClass()) return false;
    Coordinate that = (Coordinate) o;
    return Double.compare(that.lowerLeftX, lowerLeftX) == 0 && Double.compare(that.lowerLeftY, lowerLeftY) == 0 && Double.compare(that.upperRightX, upperRightX) == 0 && Double.compare(that.upperRightY, upperRightY) == 0;
}

@Override
public int hashCode() {
    return Objects.hash(lowerLeftX, lowerLeftY, upperRightX, upperRightY);
}

@Override
public int compareTo(Coordinate coordinate) {
    int ret = Double.compare(upperRightY, coordinate.upperRightY);
    if (ret == 0) {
        return Double.compare(coordinate.upperRightX, upperRightX);
    }
    return ret;
}

}

@mudassir.fayyaz
Hi, I am trying to give a simple code snippet to illustrate the problem. You don’t really need my code to identify the problem. I gave you the coordinate for the paragraph and textfragment which you can obtain from the PDF document using Aspose API directly. The problem is that textfragment which is within the paragraph has coordinate outside the paragraph’s area.

@rye3000

I have tried to find X and Y point for the first paragraph on 3rd page with below code:

Document doc = new Document(dataDir + "testExtraction.pdf");
Page page = doc.getPages().get_Item(3);
ParagraphAbsorber absorber = new ParagraphAbsorber();
absorber.visit(page);
for (PageMarkup markup : absorber.getPageMarkups())
    {
    for (MarkupSection section : markup.getSections()) {
        for (MarkupParagraph paragraph : section.getParagraphs()) {
            DrawPolygonOnPage(paragraph.getPoints(), page);
            System.out.println(paragraph.getText());
            com.aspose.pdf.Point[] polygon = paragraph.getPoints();
            System.out.println("X:" + polygon[0].getX());
            System.out.println("Y:" + polygon[0].getY());
        }
    }
}
doc.save(dataDir + "output_out.pdf");

It returns:

1. ANONYMIZATION METHODOLOGY
X:72.024
Y:703.8673600082398

You can find definition of DrawPolygonOnPage method from Extract Text from PDF document in Paragraphs form.

Then I have tried to search the period in 3rd page with this code:

Document doc = new Document(dataDir + "testExtraction.pdf");
Page pageObject = doc.getPages().get_Item(3); 
TextFragmentAbsorber tokenFragmentAbsorber = new TextFragmentAbsorber(".");
TextSearchOptions tokenSearchOptions = new TextSearchOptions(false);
tokenFragmentAbsorber.setTextSearchOptions(tokenSearchOptions);
pageObject.accept(tokenFragmentAbsorber);

TextFragmentCollection tokenCollection = tokenFragmentAbsorber.getTextFragments();
for (TextFragment fragment : tokenCollection)
{
    System.out.println(fragment.getText());
    System.out.println(fragment.getRectangle());
}

It prints:

 .
79.104,703.86736000824,82.6139999904633,719.311359966278

Now please explain your issue little more with screenshots so that we can help you accordingly.

Please use second page’s paragraph as I mentioned so that we can compare. It should return 4 points back as polygon which I gave at the beginning of this thread.

The problem is also stated at the beginning of this thread. textfragment’s coordinates are outside the paragraph’s

@rye3000

I have been able to reproduce the issue on our end. A ticket with ID PDFJAVA-40613 has been created in our issue tracking system to further investigate the issue on our end. This thread has been linked with the issue so that you may be notified once the issue will be fixed.