Free Support Forum - aspose.com

Extract Text Error

I am using the code below to extract text from each page of a pdf file. An exception is raised when the final page (page 158) is processed. The same exception is also raised if I extract text from all pages at once. The problem pdf file is attached. Thanks.

Dim doc As New Document(strFile)
Dim strPageWords As String = String.Empty
Dim intPages As Integer = doc.Pages.Count
For intPage As Integer = 1 To intPages
Dim ta As New Text.TextAbsorber()
doc.Pages(intPage).Accept(ta)
strPageWords = ta.Text
Next

System.NullReferenceException: Object reference not set to an instance of an object.
at ...ctor( )
at ..( )
at ..( )
at .€.(‹ )
at .€.()
at ..(Queue , • , ‰ )
at ..(• , ‰ )
at ..()
at ...ctor( )
at Aspose.Pdf.Text.TextAbsorber.Visit(Page page)
at Aspose.Pdf.Page.Accept(TextAbsorber visitor)

Hi James,

Thank you for sharing the template file and sample code.

I have tested your scenario with the latest version of Aspose.Pdf
for .NET v6.5
and did not find the issue reported by you. Please download and
try the latest version and check if it works fine for you.

Thank You & Best Regards,

This continues to fail for me. I am using VB in VS 2010 and the Aspose.Pdf ver 6.5 dll from the net4.0 folder from the dll only download. Please check for some issue. Thanks.

Hi James,<?xml:namespace prefix = o ns = "urn:schemas-microsoft-com:office:office" />

<o:p> </o:p>

Sorry for the inconvenience,<o:p></o:p>

<o:p> </o:p>

After further testing, I am able to regenerate your issue.
Your issue has been registered in our issue tracking system with issue id:PDFNEWNET-32894. You
will be notified via this forum thread regarding any updates against your issue.<o:p></o:p>

<o:p> </o:p>

Thank You & Best Regards,<o:p></o:p>

Hi James,

Thanks for your patience. I am pleased to share the issue PDFNEWNET-32894 reported earlier has been fixed but I am afraid now we have encountered another issue where complete text of PDF file is not being extracted. For the sake of correction, I have separately logged this problem as PDFNEWNET-33225 in our issue tracking system. We will further look into the details of this problem and will keep you updated on the status of correction. Please be patient and spare us little time. We are really sorry for this inconvenience.

The issues you have found earlier (filed as PDFNEWNET-32894) have been fixed in this update.


This message was posted using Notification2Forum from Downloads module by aspose.notifier.

Hi James,


Thanks for your patience.

We have further investigated the issue “PDFNEWNET-33225 Complete text is not being extracted from PDF file” and in order to extract the complete text, please try using the following code snippet.

Please try using the latest release version of Aspose.Pdf for .NET 7.9.0 and in case you still face the same issue or you have any further query, please feel free to contact.

[C#]

//open document<o:p></o:p>

Document pdfDocument = new Document("c:/pdftest/LJP2015_use_enww.pdf");

//string to hold extracted text

string extractedText = "";

foreach (Page pdfPage in pdfDocument.Pages)

{

using (MemoryStream textStream = new MemoryStream())

{

//create text device

TextDevice textDevice = new TextDevice(); //set text extraction options - set text extraction mode (Raw or Pure)

Aspose.Pdf.Text.TextOptions.TextExtractionOptions textExtOptions = new Aspose.Pdf.Text.TextOptions.TextExtractionOptions(Aspose.Pdf.Text.TextOptions.TextExtractionOptions.TextFormattingMode.Pure);

textDevice.ExtractionOptions = textExtOptions; //convert a particular page and save text to the stream

textDevice.Process(pdfPage, textStream); //close memory stream

textStream.Close(); //get text from memory stream

extractedText += Encoding.Unicode.GetString(textStream.ToArray());

}

}

File.WriteAllText("c:/pdftest/LJP2015_use_enww.txt", extractedText);