Problem with textAbsorber

martin.t · December 11, 2015, 10:18pm

Hi,

I’m checking whether a document contains only images:

public static bool HasOnlyImages(Aspose.Pdf.Document document)

{

for (int page = 1; page <= document.Pages.Count; page++)

{

//create TextAbsorber object to extract text

TextAbsorber textAbsorber = new TextAbsorber();

document.Pages[page].Accept(textAbsorber); // Exception occurs on this line

string extractedText = textAbsorber.Text;

extractedText = Regex.Replace(extractedText, @"[\s\r\n]+", " ");

//get the extracted text

if (extractedText != String.Empty)

{

return false;

}

return true;

}

When this runs on the attached PDF, it generates an exception within the textAbsorber :

12-Dec-2015 11:02:44 Extracting text
12-Dec-2015 11:02:44 Rendering error: System.NullReferenceException: Object reference not set to an instance of an object.
12-Dec-2015 11:02:44 at ?.?.?(Operator )
12-Dec-2015 11:02:44 at ?.?.Parse()
12-Dec-2015 11:02:44 at ?.?.(BaseOperatorCollection , Resources , Page )
12-Dec-2015 11:02:44 at ?.?.(BaseOperatorCollection , Resources )
12-Dec-2015 11:02:44 at ?.?.()
12-Dec-2015 11:02:44 at Aspose.Pdf.Text.TextAbsorber.Visit(Page page)
12-Dec-2015 11:02:44 at T1.Rendering.Renderer.RenderUtils.HasOnlyImages(Document document)

It works for many other PDF’s. I can’t see anything wrong or special with the PDF that fails.

Is this an Aspose bug, or is there something wrong with this PDF?

Thanks,

Martin

codewarior · December 14, 2015, 1:43am

Hi Martin,

Thanks for contacting support.

I have tested the scenario using Aspose.Pdf for .NET 11.0.0 in VisualStudio 2010 project with .NET Framework 4.0 running over Windows 7 (x64) and I am unable to notice any issue. Can you please try using latest release and see if it resolves your problem. In case the problem still persists, please share some details regarding your working environment. We are sorry for this inconvenience.