Aspose PDF Exception: Operand value is not a name

Hello,

My team is using the Aspose.PDF library for . NET to extract text from PDF files. We ran into the following exception while using the TextAbsorber to extract text page by page:

Aspose.Pdf.PdfException: Operand value is not a name
at #=zyt4T9KO7peVjkhq3xluWgWvLUxlueatyfBhU0$bz4ekX.#=zpVFbElM=()
at #=zhwl8667iwsEz6rze47bjzpYYwNEMl$3tLQG6InPVjqRbrrW5fXY$J94=.#=z6QD6iaDT30UG(Int32 #=zu_nAOcU=, Operator #=zXwUxPQE=)
at #=zhwl8667iwsEz6rze47bjzpYYwNEMl$3tLQG6InPVjqRbrrW5fXY$J94=.#=zbQhQKFg=(Page #=ztL8V05k=)
at #=z9W8OEM$p8$g7694whr8T0tKyLlpInzoY3I4MjFJfDkJbn$j9eoDDDq8FlOns.#=zwQHe3GexITOq(BaseOperatorCollection #=zWR6Slpk=, Resources #=za3NwiOk=, Page #=ztL8V05k=, Rectangle #=zsJwR5inyT$sP)
at #=z9W8OEM$p8$g7694whr8T0tKyLlpInzoY3I4MjFJfDkJbn$j9eoDDDq8FlOns.#=zwQHe3GexITOq(BaseOperatorCollection #=zWR6Slpk=, Resources #=za3NwiOk=, Rectangle #=zsJwR5inyT$sP)
at #=z9W8OEM$p8$g7694whr8T0tKyLlpInzoY3I4MjFJfDkJbn$j9eoDDDq8FlOns.#=zGK7Mmdc=(Boolean #=zf9_O69sVgPb0)
at #=z9W8OEM$p8$g7694whr8T0tKyLlpInzoY3I4MjFJfDkJbn$j9eoDDDq8FlOns..ctor(Page #=ztL8V05k=, TextSearchOptions #=zqQYmXUFMg2zg, Boolean #=zGVp0$i07r2iN)
at #=z9W8OEM$p8$g7694whr8T0tKyLlpInzoY3I4MjFJfDkJbn$j9eoDDDq8FlOns..ctor(Page #=ztL8V05k=, TextSearchOptions #=zqQYmXUFMg2zg)
at Aspose.Pdf.Text.TextAbsorber.Visit(Page page)
at Aspose.Pdf.Page.Accept(TextAbsorber visitor)

This is a snippet of our code for extracting text from PDFs:

public (string textContent, int pageCount) PdfToText(byte[] sourceBytes)
{
	using (var inputStream = new MemoryStream(sourceBytes))
	{
		using (var document = new Aspose.Pdf.Document(inputStream))
		{
			int pageLimit = int.Min(ExtractionConfig.PdfNumberOfPages, document.Pages.Count);
			TextAbsorber textAbsorber = new TextAbsorber();

			if (!ExtractionConfig.ReadByPageNumber || ExtractionConfig.ReadPages == null || ExtractionConfig.ReadPages.Length == 0)
			{
				// Read pages up to the page limit starting from the beginning of the document
				for (int i = 1; i <= pageLimit; i++)
				{
					document.Pages[i].Accept(textAbsorber);
				}
			}
			else
			{
				// Read by page number if the user has selected specific pages to read
				var length = document.Pages.Count;
				var lengthInvert = length * -1; //waste some memory for speed of not excuting per page
				var pages = ExtractionConfig.ReadPages!.Where(i => i < length && i > lengthInvert).ToArray();
				for (int i = 1; i <= pages.Length; i++)
				{
					//index through the indexs and if any are negative invert them from the end
					var trueIndex = (pages[i] >= 0 ? pages[i] : length + pages[i]) + 1;
					document.Pages[trueIndex].Accept(textAbsorber);
				}
			}
			return (textAbsorber.Text, document.Pages.Count);
		}
	}
}

Unfortunately, due to an NDA agreement, I cannot share the file that caused the exception. However, the file does open in a normal PDF viewer (Microsoft Edge) suggesting that the file is not corrupted.

The code extracts text from most PDFs, but we get the PDF exception on 3 files out of 1000 files.

Could I get some guidance on why I’m getting this PDF exception and how I can fix it please.

Thank you

@anguyen1

The exception “Operand value is not a name” typically indicates an issue with the PDF structure or content that the Aspose.PDF library cannot process correctly. This can occur with certain PDF files that may contain non-standard or corrupted elements, even if they open fine in standard PDF viewers.

Here are some suggestions to help you troubleshoot and potentially resolve this issue:

  1. Check PDF Structure: Use a PDF validation tool to analyze the structure of the problematic files. Issues may be identified that are not visible in standard viewers.

  2. Implement Try-Catch Block: Surround your text extraction code with a try-catch block to handle exceptions gracefully. This allows you to log specific pages or content that cause the issue without crashing your application.

  3. Update Aspose.PDF: Ensure you are using the latest version of the Aspose.PDF library. Updates often include bug fixes and improvements that may resolve your issue.

  4. Use Alternative Extraction Methods: If the TextAbsorber is failing, consider using other methods for text extraction, such as TextFragmentAbsorber, which may handle certain PDF structures more effectively.

  5. Contact Support: If the issue persists, consider reaching out to Aspose support with details about the problematic files. They may provide insights or fixes specific to the files you are working with.

Here’s a modified version of your code with a try-catch block:

public (string textContent, int pageCount) PdfToText(byte[] sourceBytes)
{
    using (var inputStream = new MemoryStream(sourceBytes))
    {
        using (var document = new Aspose.Pdf.Document(inputStream))
        {
            int pageLimit = Math.Min(ExtractionConfig.PdfNumberOfPages, document.Pages.Count);
            TextAbsorber textAbsorber = new TextAbsorber();

            try
            {
                for (int i = 1; i <= pageLimit; i++)
                {
                    document.Pages[i].Accept(textAbsorber);
                }
            }
            catch (Aspose.Pdf.PdfException ex)
            {
                // Log the exception and continue processing other pages
                Console.WriteLine($"Error extracting text from page {i}: {ex.Message}");
            }

            return (textAbsorber.Text, document.Pages.Count);
        }
    }
}

This approach will help you identify which pages are causing issues without halting the entire extraction process. If you continue to face issues, consider sharing the details with Aspose support for further assistance.

@anguyen1
Ideally you should check how the file opens in Adobe Acrobat - browsers usually vary in implementation of PDF view and may ignore some problems while Acrobat works as universal standard and may have more nuanced data on why errors in document present.
Usually TextAbsorber works without such issues, I haven’t found similar issues in backlog , so unfortunately it seems as document specific issue and it would be difficult to explore it further without original document. Is it possible to remove sensitive information from document that causes exception?