Hi team,
We are facing issue while matching text from the source string to target string. We are getting the target string from aspose utility by passing our sample pdf file.
The source string is not getting matched to the target string.
Please find attached the code snippet and the sample source pdf file.
Code Snippet
Document pdfDocument = new Document(TestDocument.pdf);
TextAbsorber textAbsorber = new TextAbsorber();
textAbsorber.visit(pdfDocument);
String pdfText = textAbsorber.getText()
pdfText.contains("Citi Research is a division of Citigroup Global Markets Inc. (the "Firm"), which does and seeks to do business with companies covered in its research
reports. As a result, investors should be aware that the Firm may have a conflict of interest that could affect the objectivity of this report. Investors should
consider this report as only a single factor in making their investment decision. Certain products (not inconsistent with the author's published research) are
available only on Citi's portals.");
TestDocument.pdf (137.1 KB)
@forasposeissues
Please use the TextExtractionOptions to extract the content in Raw format like below:
Document pdfDocument = new Document(dataDir + "TestDocument.pdf");
TextAbsorber textAbsorber = new TextAbsorber(new TextExtractionOptions(TextExtractionOptions.TextFormattingMode.Raw));
textAbsorber.visit(pdfDocument);
String pdfText = textAbsorber.getText();
pdfText.contains("Citi Research is a division of Citigroup Global Markets Inc. (the ""Firm""), which does and seeks to do business with companies covered in its research
reports. As a result, investors should be aware that the Firm may have a conflict of interest that could affect the objectivity of this report. Investors should
consider this report as only a single factor in making their investment decision. Certain products (not inconsistent with the author's published research) are
available only on Citi's portals.");
Hi Team,
We are using the below dependency to use Textabsorber. TextExtractionOptions class is not available in below version.
asposej
aspose-pdf
9.7.0
jdk16
Kindly let us know which version of aspose-pdf we should use.
Appreciate your help!
Thanks
@forasposeissues
We are afraid that you are using a quite older version of the API. Many classes and methods have been obsolete since then. We always recommend using the latest version for keep getting benefits of improvements and updates in the API. Please try using 23.5 version and let us know if you notice any issues.
Hi Team,
We are using the below dependency and code snippet as mentioned by you. At Runtime we are facing execption. Please find logs
com.aspose
aspose-pdf
23.5
jdk17
TextAbsorber textAbsorber = new TextAbsorber(new TextExtractionOptions( TextExtractionOptions.TextFormattingMode.Raw ));
textAbsorber.visit(pdfDocument);
- Error while matching text for pdfDocument
com.aspose.pdf.exceptions.IndexOutOfRangeException: At most 4 elements (for any collection) can be viewed in evaluation mode.
at com.aspose.pdf.ADocument.lf(Unknown Source)
at com.aspose.pdf.PageCollection.lf(Unknown Source)
at com.aspose.pdf.PageCollection.get_Item(Unknown Source)
at com.aspose.pdf.TextAbsorber.visit(Unknown Source)
Thanks
@forasposeissues
Looks like you are using the API without any valid license. Please apply a 30-days free temporary license to evaluate the latest version as per your need and let us know in case you still notice any issues. Once your evaluation is complete, you can upgrade your license subscription.