Extract Text Error

jrapp · December 21, 2011, 7:54am

I am using the code below to extract text from each page of a pdf file. An exception is raised when the final page (page 158) is processed. The same exception is also raised if I extract text from all pages at once. The problem pdf file is attached. Thanks.

Dim doc As New Document(strFile)
Dim strPageWords As String = String.Empty
Dim intPages As Integer = doc.Pages.Count
For intPage As Integer = 1 To intPages
Dim ta As New Text.TextAbsorber()
doc.Pages(intPage).Accept(ta)
strPageWords = ta.Text
Next

System.NullReferenceException: Object reference not set to an instance of an object.
at ...ctor( )
at ..( )
at ..( )
at ..( )
at ..()
at ..(Queue , , )
at ..( , )
at ..()
at ...ctor( )
at Aspose.Pdf.Text.TextAbsorber.Visit(Page page)
at Aspose.Pdf.Page.Accept(TextAbsorber visitor)

nausherwan.aslam · December 21, 2011, 12:08pm

Hi James,

Thank you for sharing the template file and sample code.

I have tested your scenario with the latest version of Aspose.Pdf
for .NET v6.5 and did not find the issue reported by you. Please download and
try the latest version and check if it works fine for you.

Thank You & Best Regards,

jrapp · December 21, 2011, 1:16pm

This continues to fail for me. I am using VB in VS 2010 and the Aspose.Pdf ver 6.5 dll from the net4.0 folder from the dll only download. Please check for some issue. Thanks.

nausherwan.aslam · December 22, 2011, 12:32am

Hi James,

Sorry for the inconvenience.

After further testing, I am able to regenerate your issue. Your issue has been registered in our issue tracking system with issue id: PDFNEWNET-32894. You will be notified via this forum thread regarding any updates against your issue.

Thank You & Best Regards,

codewarior · February 9, 2012, 3:51am

Hi James,

Thanks for your patience. I am pleased to share the issue PDFNEWNET-32894 reported earlier has been fixed but I am afraid now we have encountered another issue where complete text of PDF file is not being extracted. For the sake of correction, I have separately logged this problem as PDFNEWNET-33225 in our issue tracking system. We will further look into the details of this problem and will keep you updated on the status of correction. Please be patient and spare us little time. We are really sorry for this inconvenience.

aspose.notifier · February 10, 2012, 5:14am

The issues you have found earlier (filed as PDFNEWNET-32894) have been fixed in this update.

This message was posted using Notification2Forum from Downloads module by aspose.notifier.

codewarior · April 19, 2013, 6:47am

Hi James,

Thanks for your patience.

We have further investigated the issue “PDFNEWNET-33225: Complete text is not being extracted from PDF file” and in order to extract the complete text, please try using the following code snippet.

Please try using the latest release version of Aspose.Pdf for .NET 7.9.0 and in case you still face the same issue or you have any further query, please feel free to contact.

C#

    //open document
    Document pdfDocument = new Document("c:/pdftest/LJP2015_use_enww.pdf");

    //string to hold extracted text
    string extractedText = "";

    foreach (Page pdfPage in pdfDocument.Pages)
    {
        using (MemoryStream textStream = new MemoryStream())
        {
            //create text device
            TextDevice textDevice = new TextDevice();  //set text extraction options - set text extraction mode (Raw or Pure)
            Aspose.Pdf.Text.Textoptions.TextExtractionOptions textExtOptions = new Aspose.Pdf.Text.Textoptions.TextExtractionOptions(Aspose.Pdf.Text.Textoptions.TextExtractionOptions.TextFormattingMode.Pure);

            textDevice.ExtractionOptions = textExtOptions;  //convert a particular page and save text to the stream
            textDevice.Process(pdfPage, textStream);  //close memory stream
            textStream.Close();

            //get text from memory stream
            extractedText += Encoding.Unicode.GetString(textStream.ToArray());
        }
    }

    File.WriteAllText("c:/pdftest/LJP2015_use_enww.txt", extractedText);