Aspose pdf: get text inside drawn rectangle

karine_87 · February 12, 2015, 2:14am

Hello

We are using the aspose.pdf 9.7.0 to get the highlighted text from annotations, the problem is that in some pdf files when we highlight a specific text, other text is also highlighted (check attached file input.pdf, where we try to highlight the text <span style=“font-size:10.0pt;font-family:“Arial”,“sans-serif”;
mso-fareast-font-family:“Times New Roman””>ET 71140175 but different paragraphs are also highlighted) so we will get incorrect info when getting text from annotations.
We are trying to overcome this problem by using rectangles instead of annotations in order to be able to select the text needed (check attached file inputRect.pdf),
Can you please help us to get the text inside the rectangle using aspose.pdf 9.7.0 and to get the comment made for this rectangle?
or do you have other ideas to overcome the main problem?
Thank you,
Karine

tilal.ahmad · February 12, 2015, 9:57pm

Hi Karine,

Thanks for your inquiry. I have tested the scenario using your sample document with Aspose.Pdf for Java 9.7.1. The text “ET 71140175” has two occurrences. First occurrence (rectangle box at the top of page) is not being recognized by both Aspose and Adobe. However TextFragmentAbsorber is returning wrong coordinates of the second occurrence (bottom of the page), so we logged a ticket PDFNEWJAVA-34711 in our issue tracking system for further investigation. We will notify you as soon as it is resolved.

Moreover, we would appreciate it if you could share your sample code that highlights the wrong text. It will help us to investigate and resolve the issue. I have used the following code sample for testing.

Document document = new Document(myDir + "inputRect.pdf");

com.aspose.pdf.TextFragmentAbsorber textFragmentAbsorber1 = new com.aspose.pdf.TextFragmentAbsorber("ET 71140175", new TextSearchOptions(true));

for (int cnt = 1; cnt <= document.getPages().size(); cnt++) {
    Page page = document.getPages().get_Item(cnt);
    page.accept(textFragmentAbsorber1);
}

TextFragmentCollection textFragmentCollection1 = textFragmentAbsorber1.getTextFragments();

for (int cnt1 = 1; cnt1 <= textFragmentCollection1.size(); cnt1++) {
    TextFragment textFragment = textFragmentCollection1.get_Item(cnt1);
    textFragment.getTextState().setForegroundColor(com.aspose.pdf.Color.getBlack());
    textFragment.getTextState().setBackgroundColor(com.aspose.pdf.Color.getLightBlue());

    com.aspose.pdf.Rectangle rect = new com.aspose.pdf.Rectangle(
        (float) textFragment.getPosition().getXIndent(),
        (float) textFragment.getPosition().getYIndent(),
        (float) textFragment.getPosition().getXIndent() + (float) textFragment.getRectangle().getWidth(),
        (float) textFragment.getPosition().getYIndent() + (float) textFragment.getRectangle().getHeight()

    );

    HighlightAnnotation highlight = new HighlightAnnotation(textFragment.getPage(),rect);
    highlight.setOpacity(.80);
    highlight.setBorder(new Border(highlight));
    highlight.setColor(com.aspose.pdf.Color.getLightBlue());
    textFragment.getPage().getAnnotations().add(highlight);
}

// save updated document - you can set your output file here
document.save(myDir + "output_highlight.pdf");

We are sorry for the inconvenience caused.

Best Regards,

karine_87 · February 13, 2015, 1:30am

Hello,
Thank you for your quick response,
Sorry for not being so clear in my previous post,
I meant that we highlight the text “ET 71140175” using adobe and not aspose, but we use aspose to get the highlighted text, the code is :

Zone zone = null;
Vector vZones = new Vector();
AnnotationCollection annots = page.getAnnotations();

for (int j = 1; j <= annots.size(); j++) {
com.aspose.pdf.Annotation annot = annots.get_Item(j);
if (annot instanceof HighlightAnnotation){
HighlightAnnotation linkAnno = (HighlightAnnotation)annot;
com.aspose.pdf.Rectangle rect = linkAnno.getRect();
rect.setLLY(rect.getLLY() - 1);
if (annot.getName() != null) {
TextAbsorber absorber = new TextAbsorber();
absorber.getTextSearchOptions().setLimitToPageBounds(true);
absorber.getTextSearchOptions().setRectangle(rect);
page.accept(absorber);
String text = absorber.getText();
if (!text.trim().isEmpty() && !annot.getContents().isEmpty()) {
zone = new Zone(
rect.getLLX(),
rect.getLLY(),
rect.getURX(),
rect.getURY(),
annot.getContents(),
page.getNumber());

vZones.add(zone);

}
}
}
}

But since adobe is not allowing to highlight only the text “ET 71140175” (see document input.pdf, open the file using adobe and try to select only “ET 71140175” you will see that other text are also selected), so when using aspose to return the highlighted text, we will get other text along with ET 71140175.
I know that till here the problem is from adobe and not aspose,
but i tried to fix this issue by drawing rectangle instead of highlight in adobe for the text “ET 71140175” (see file inputRect.pdf), and i needed to use aspose to get the text inside this rectangle.

Thank you in advance,
Karine

tilal.ahmad · February 13, 2015, 11:27am

Hi Karine,

Thanks for providing additional information. We have already noticed that first occurrences of subjected text is not being recognizing and logged a ticket in our issue tracking system for further investigation. As soon as we complete the investigation we will let you know our findings.

Meanwhile, we will appreciate it if you please share some details how this document created. It will help us in investigation.

We are sorry for the inconvenience caused.

Best Regards,

karine_87 · February 16, 2015, 3:26am

Hello,
Sorry i don’t have any info about how this document was created,
if I get any detail i will let you know.

Thank you,
Karine

tilal.ahmad · February 16, 2015, 10:36pm

Hi Karine.

Thanks for your feedback. Sure please let us know if you find any detail about creation of the problematic document. However we will update you as soon as we made any progress towards issue resolution.

Best Regards,

aspose.notifier · May 12, 2015, 1:53pm

The issues you have found earlier (filed as PDFNEWJAVA-34711) have been fixed in Aspose.Pdf for Java 10.3.0.

This message was posted using Notification2Forum from Downloads module by Aspose Notifier.