Hello,
My team is using the Aspose.PDF library for . NET to extract text from PDF files. We ran into the following exception while using the TextAbsorber to extract text page by page:
Aspose.Pdf.PdfException: Operand value is not a name
at #=zyt4T9KO7peVjkhq3xluWgWvLUxlueatyfBhU0$bz4ekX.#=zpVFbElM=()
at #=zhwl8667iwsEz6rze47bjzpYYwNEMl$3tLQG6InPVjqRbrrW5fXY$J94=.#=z6QD6iaDT30UG(Int32 #=zu_nAOcU=, Operator #=zXwUxPQE=)
at #=zhwl8667iwsEz6rze47bjzpYYwNEMl$3tLQG6InPVjqRbrrW5fXY$J94=.#=zbQhQKFg=(Page #=ztL8V05k=)
at #=z9W8OEM$p8$g7694whr8T0tKyLlpInzoY3I4MjFJfDkJbn$j9eoDDDq8FlOns.#=zwQHe3GexITOq(BaseOperatorCollection #=zWR6Slpk=, Resources #=za3NwiOk=, Page #=ztL8V05k=, Rectangle #=zsJwR5inyT$sP)
at #=z9W8OEM$p8$g7694whr8T0tKyLlpInzoY3I4MjFJfDkJbn$j9eoDDDq8FlOns.#=zwQHe3GexITOq(BaseOperatorCollection #=zWR6Slpk=, Resources #=za3NwiOk=, Rectangle #=zsJwR5inyT$sP)
at #=z9W8OEM$p8$g7694whr8T0tKyLlpInzoY3I4MjFJfDkJbn$j9eoDDDq8FlOns.#=zGK7Mmdc=(Boolean #=zf9_O69sVgPb0)
at #=z9W8OEM$p8$g7694whr8T0tKyLlpInzoY3I4MjFJfDkJbn$j9eoDDDq8FlOns..ctor(Page #=ztL8V05k=, TextSearchOptions #=zqQYmXUFMg2zg, Boolean #=zGVp0$i07r2iN)
at #=z9W8OEM$p8$g7694whr8T0tKyLlpInzoY3I4MjFJfDkJbn$j9eoDDDq8FlOns..ctor(Page #=ztL8V05k=, TextSearchOptions #=zqQYmXUFMg2zg)
at Aspose.Pdf.Text.TextAbsorber.Visit(Page page)
at Aspose.Pdf.Page.Accept(TextAbsorber visitor)
This is a snippet of our code for extracting text from PDFs:
public (string textContent, int pageCount) PdfToText(byte[] sourceBytes)
{
using (var inputStream = new MemoryStream(sourceBytes))
{
using (var document = new Aspose.Pdf.Document(inputStream))
{
int pageLimit = int.Min(ExtractionConfig.PdfNumberOfPages, document.Pages.Count);
TextAbsorber textAbsorber = new TextAbsorber();
if (!ExtractionConfig.ReadByPageNumber || ExtractionConfig.ReadPages == null || ExtractionConfig.ReadPages.Length == 0)
{
// Read pages up to the page limit starting from the beginning of the document
for (int i = 1; i <= pageLimit; i++)
{
document.Pages[i].Accept(textAbsorber);
}
}
else
{
// Read by page number if the user has selected specific pages to read
var length = document.Pages.Count;
var lengthInvert = length * -1; //waste some memory for speed of not excuting per page
var pages = ExtractionConfig.ReadPages!.Where(i => i < length && i > lengthInvert).ToArray();
for (int i = 1; i <= pages.Length; i++)
{
//index through the indexs and if any are negative invert them from the end
var trueIndex = (pages[i] >= 0 ? pages[i] : length + pages[i]) + 1;
document.Pages[trueIndex].Accept(textAbsorber);
}
}
return (textAbsorber.Text, document.Pages.Count);
}
}
}
Unfortunately, due to an NDA agreement, I cannot share the file that caused the exception. However, the file does open in a normal PDF viewer (Microsoft Edge) suggesting that the file is not corrupted.
The code extracts text from most PDFs, but we get the PDF exception on 3 files out of 1000 files.
Could I get some guidance on why Iām getting this PDF exception and how I can fix it please.
Thank you
@anguyen1
The exception āOperand value is not a nameā typically indicates an issue with the PDF structure or content that the Aspose.PDF library cannot process correctly. This can occur with certain PDF files that may contain non-standard or corrupted elements, even if they open fine in standard PDF viewers.
Here are some suggestions to help you troubleshoot and potentially resolve this issue:
-
Check PDF Structure: Use a PDF validation tool to analyze the structure of the problematic files. Issues may be identified that are not visible in standard viewers.
-
Implement Try-Catch Block: Surround your text extraction code with a try-catch block to handle exceptions gracefully. This allows you to log specific pages or content that cause the issue without crashing your application.
-
Update Aspose.PDF: Ensure you are using the latest version of the Aspose.PDF library. Updates often include bug fixes and improvements that may resolve your issue.
-
Use Alternative Extraction Methods: If the TextAbsorber is failing, consider using other methods for text extraction, such as TextFragmentAbsorber, which may handle certain PDF structures more effectively.
-
Contact Support: If the issue persists, consider reaching out to Aspose support with details about the problematic files. They may provide insights or fixes specific to the files you are working with.
Hereās a modified version of your code with a try-catch block:
public (string textContent, int pageCount) PdfToText(byte[] sourceBytes)
{
using (var inputStream = new MemoryStream(sourceBytes))
{
using (var document = new Aspose.Pdf.Document(inputStream))
{
int pageLimit = Math.Min(ExtractionConfig.PdfNumberOfPages, document.Pages.Count);
TextAbsorber textAbsorber = new TextAbsorber();
try
{
for (int i = 1; i <= pageLimit; i++)
{
document.Pages[i].Accept(textAbsorber);
}
}
catch (Aspose.Pdf.PdfException ex)
{
// Log the exception and continue processing other pages
Console.WriteLine($"Error extracting text from page {i}: {ex.Message}");
}
return (textAbsorber.Text, document.Pages.Count);
}
}
}
This approach will help you identify which pages are causing issues without halting the entire extraction process. If you continue to face issues, consider sharing the details with Aspose support for further assistance.
@anguyen1
Ideally you should check how the file opens in Adobe Acrobat - browsers usually vary in implementation of PDF view and may ignore some problems while Acrobat works as universal standard and may have more nuanced data on why errors in document present.
Usually TextAbsorber works without such issues, I havenāt found similar issues in backlog , so unfortunately it seems as document specific issue and it would be difficult to explore it further without original document. Is it possible to remove sensitive information from document that causes exception?
Same issue here, only specific pages are problematic.
Many of my files (mostly books) were working before with Aspose.PDF.Drawing version 25.5.0, so I believe it is a regression.
Also, I noticed the PDF processing in general has become noticeably slower in the latest versions.
Is the PDF structure validation stricter now than before?
@loremipsummer
Could you provide file and code if it differs that reproduces the issue?
Itās indeed seems like a regression but without data I canāt assign task for development team to fix it
@ilyazhuykov
Sure, consider the following example:
byte[] pdfBytes = File.ReadAllBytes("hobbsparker.pdf");
using var inputStream = new MemoryStream(pdfBytes);
using var document = new Document(inputStream);
int pageLimit = document.Pages.Count;
TextAbsorber textAbsorber = new();
try
{
for (int i = 1; i <= pageLimit; i++)
{
document.Pages[i].Accept(textAbsorber);
}
}
catch (Aspose.Pdf.PdfException ex)
{
Console.WriteLine($"Error extracting text from page: {ex.Message} \n {ex.StackTrace}");
}
With the following code, Aspose.Pdf.Drawing version 25.6.0 can read the content of the attached pdf document, however starting from version 25.7.0, I get this:
Exception thrown: āAspose.Pdf.PdfExceptionā in Aspose.PDF.dll
Error extracting text from page: Operand value is not a name
at #=zVBesxZiXHZA5mO62SKcL97vZK1hpkl7JFKJAjo1JntIf.#=zLw$Vy5k=()
at #=zBjWHJDYzyNGAFLfXhBRIa2dqiTGsBebVk8c9fvjQQ6A_FMERTrBDLJY=.#=z$qayYR0w_fkB(Int32 #=zkU4oz0g=, Operator #=z6d$LxOI=)
at #=zBjWHJDYzyNGAFLfXhBRIa2dqiTGsBebVk8c9fvjQQ6A_FMERTrBDLJY=.#=zKKBBBB4=(Page #=z1R1INMY=)
at #=z1id5WZKEpXEW2gLDhbXgQe4ceclTdI2YhjrM_$2nUOB7gFFswA64MC62ClmI.#=zLX88p6oGv805(BaseOperatorCollection #=zkqIIH0Y=, Resources #=zpaCdI2M=, Page #=z1R1INMY=, Rectangle #=zXpUQ1qFjcW3B)
at #=z1id5WZKEpXEW2gLDhbXgQe4ceclTdI2YhjrM_$2nUOB7gFFswA64MC62ClmI.#=zLX88p6oGv805(BaseOperatorCollection #=zkqIIH0Y=, Resources #=zpaCdI2M=, Rectangle #=zXpUQ1qFjcW3B)
at #=z1id5WZKEpXEW2gLDhbXgQe4ceclTdI2YhjrM_$2nUOB7gFFswA64MC62ClmI.#=zPSswGBc=(Boolean #=zloA2CikVvt6j)
at #=z1id5WZKEpXEW2gLDhbXgQe4ceclTdI2YhjrM_$2nUOB7gFFswA64MC62ClmIā¦ctor(Page #=z1R1INMY=, TextSearchOptions #=z8Qek_zpOgstm, Boolean #=zE8PzQCcJnkRW)
at #=z1id5WZKEpXEW2gLDhbXgQe4ceclTdI2YhjrM_$2nUOB7gFFswA64MC62ClmIā¦ctor(Page #=z1R1INMY=, TextSearchOptions #=z8Qek_zpOgstm)
at Aspose.Pdf.Text.TextAbsorber.Visit(Page page)
at Aspose.Pdf.Page.Accept(TextAbsorber visitor)
hobbsparker.pdf (827.4 KB)
1 Like
@anguyen1
We have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.
Issue ID(s): PDFNET-61214
You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.
Thank you for provided files, I managed to reproduce the issue and created task for development team.