Source string not getting matched with the target string

forasposeissues · June 13, 2023, 10:53am

Hi team,

We are facing issue while matching text from the source string to target string. We are getting the target string from aspose utility by passing our sample pdf file.
The source string is not getting matched to the target string.
Please find attached the code snippet and the sample source pdf file.

Code Snippet

Document pdfDocument = new Document(TestDocument.pdf);

TextAbsorber textAbsorber = new TextAbsorber();
                    textAbsorber.visit(pdfDocument);
                    String pdfText = textAbsorber.getText()

pdfText.contains("Citi Research is a division of Citigroup Global Markets Inc. (the "Firm"), which does and seeks to do business with companies covered in its research
reports. As a result, investors should be aware that the Firm may have a conflict of interest that could affect the objectivity of this report. Investors should
consider this report as only a single factor in making their investment decision. Certain products (not inconsistent with the author's published research) are
available only on Citi's portals.");

TestDocument.pdf (137.1 KB)

asad.ali · June 13, 2023, 7:30pm

@forasposeissues

Please use the TextExtractionOptions to extract the content in Raw format like below:

Document pdfDocument = new Document(dataDir + "TestDocument.pdf");

TextAbsorber textAbsorber = new TextAbsorber(new TextExtractionOptions(TextExtractionOptions.TextFormattingMode.Raw));
textAbsorber.visit(pdfDocument);
String pdfText = textAbsorber.getText();

pdfText.contains("Citi Research is a division of Citigroup Global Markets Inc. (the ""Firm""), which does and seeks to do business with companies covered in its research 
reports. As a result, investors should be aware that the Firm may have a conflict of interest that could affect the objectivity of this report. Investors should 
consider this report as only a single factor in making their investment decision. Certain products (not inconsistent with the author's published research) are 
available only on Citi's portals.");

forasposeissues · June 19, 2023, 6:16pm

Hi Team,

We are using the below dependency to use Textabsorber. TextExtractionOptions class is not available in below version.

asposej aspose-pdf 9.7.0 jdk16

Kindly let us know which version of aspose-pdf we should use.

Appreciate your help!

Thanks

asad.ali · June 20, 2023, 12:58am

@forasposeissues

We are afraid that you are using a quite older version of the API. Many classes and methods have been obsolete since then. We always recommend using the latest version for keep getting benefits of improvements and updates in the API. Please try using 23.5 version and let us know if you notice any issues.

forasposeissues · June 20, 2023, 8:38am

Hi Team,

We are using the below dependency and code snippet as mentioned by you. At Runtime we are facing execption. Please find logs

com.aspose aspose-pdf 23.5 jdk17

TextAbsorber textAbsorber = new TextAbsorber(new TextExtractionOptions( TextExtractionOptions.TextFormattingMode.Raw ));
textAbsorber.visit(pdfDocument);

Error while matching text for pdfDocument
com.aspose.pdf.exceptions.IndexOutOfRangeException: At most 4 elements (for any collection) can be viewed in evaluation mode.
at com.aspose.pdf.ADocument.lf(Unknown Source)
at com.aspose.pdf.PageCollection.lf(Unknown Source)
at com.aspose.pdf.PageCollection.get_Item(Unknown Source)
at com.aspose.pdf.TextAbsorber.visit(Unknown Source)

Thanks

asad.ali · June 20, 2023, 6:39pm

@forasposeissues

Looks like you are using the API without any valid license. Please apply a 30-days free temporary license to evaluate the latest version as per your need and let us know in case you still notice any issues. Once your evaluation is complete, you can upgrade your license subscription.