Aspose PDF for NET - Gives error while extracting text

sagardeshpande · April 8, 2025, 5:17am

Hi Team

We have a licensed product of Aspose for NET (version 24.12.0.0). We are using it to extract the text from the files. While it works fine for MS files (docx, pptx, xlsx), it gives the below error while extracting text from PDF.
This is from the web app deployed to Azure app service.
We have tried different code snippets such as TextAbsorber, PdfExtractor, TextFragmentAbsorber but still got the same error using Stream of file.
We have also tried uploading all the Fonts to the folder on Az web app SCM site and loaded them before processing, as we got this help in one of the thread, but that didn’t work either.

ERROR -
ERROR, ExtractTextForGenAI,ExtractTextFromPdf,File Content as Text,Object reference not set to an instance of an object. at #=zKGtBu_r4CWQYEpV3WrUXJQ$7hEV$U$yMnzPj5y$DvLEHDjUY1gcXfQI=.#=z7K5fcVA=(SelectFont #=zSGExpSA=)
at #=zKGtBu_r4CWQYEpV3WrUXJQ$7hEV$U$yMnzPj5y$DvLEHDjUY1gcXfQI=.#=zXFkMsOY=(Page #=zQ0w9hZk=)
at #=zQO_jvZn7T8UjkCMord1dcMZgZWujy9S8nn7UJQhn6J6FMVQoeedTn4Fq1ZDm.#=zfrImHu2fUeed(BaseOperatorCollection #=zEztrpv0=, Resources #=z4B4h3qw=, Page #=zQ0w9hZk=, Rectangle #=zlvaXEqeghKrI)
at #=zQO_jvZn7T8UjkCMord1dcMZgZWujy9S8nn7UJQhn6J6FMVQoeedTn4Fq1ZDm.#=zfrImHu2fUeed(BaseOperatorCollection #=zEztrpv0=, Resources #=z4B4h3qw=, Rectangle #=zlvaXEqeghKrI)
at #=zQO_jvZn7T8UjkCMord1dcMZgZWujy9S8nn7UJQhn6J6FMVQoeedTn4Fq1ZDm.#=zf3QpCxw=(Boolean #=z513IdM2B1xR7)
at #=zQO_jvZn7T8UjkCMord1dcMZgZWujy9S8nn7UJQhn6J6FMVQoeedTn4Fq1ZDm…ctor(Page #=zQ0w9hZk=, TextSearchOptions #=zBLsaPkrQg0Cz, Boolean #=zrD7ZclezTy7v)
at Aspose.Pdf.Text.TextAbsorber.Visit(Page page)
at Aspose.Pdf.Facades.PdfExtractor.ExtractText(Encoding encoding)
at Aspose.Pdf.Facades.PdfExtractor.ExtractText()
at ExtractTextForGenAI.ExtractTextFromPdf(Stream stream, StringBuilder builder, String filePath) in Services\GenAI\ExtractTextForGenAI.cs:line 167

Professionalize.Discourse · April 8, 2025, 5:17am

@sagardeshpande

It seems you are encountering a NullReferenceException while trying to extract text from PDF files using Aspose.PDF for .NET. This error typically indicates that the code is attempting to access an object that has not been instantiated.

Here are some steps you can take to troubleshoot and potentially resolve the issue:

Check PDF File Integrity: Ensure that the PDF file you are trying to process is not corrupted. You can try opening it in a PDF viewer to confirm that it displays correctly.
Use TextSearchOptions: When using the TextAbsorber or PdfExtractor, consider setting the TextSearchOptions to log text extraction errors. This can help you identify specific issues during the extraction process. You can enable logging by setting the LogTextExtractionErrors property to true:
```
TextSearchOptions options = new TextSearchOptions();
options.LogTextExtractionErrors = true;
TextAbsorber absorber = new TextAbsorber(options);
```
Font Management: Since you mentioned uploading fonts to the Azure app service, ensure that the fonts are correctly loaded before processing the PDF. You can use the FontRepository to load fonts from a specific directory:
```
FontRepository.Sources.Add(new FolderFontSource("path_to_fonts"));
```
Error Handling: Implement error handling in your code to catch exceptions and log detailed information about the error. This can help you pinpoint the exact location and cause of the issue.
Review Documentation: For more detailed information on text extraction and error handling, refer to the Aspose.PDF documentation, particularly the sections on TextAbsorber and TextSearchOptions.

If the problem persists after trying these steps, consider reaching out to Aspose support with the specific PDF file and the code snippet you are using. They may be able to provide more targeted assistance.

For further reference, you can check the following documentation:

If you have any more questions or need additional help, feel free to ask!