System.OutOfRangeException in getting text from a PDF

Hi, I just need to get text from a set of PDF files.


I attach a sample file and the code which raised the exception:

Dim asposeDoc As Aspose.Pdf.Document = New Aspose.Pdf.Document(f.FullName)

Dim textAbsorber As New Text.TextAbsorber()
'accept the absorber for all the pages
asposeDoc.Pages.Accept(textAbsorber)
'get the extracted text
Dim extractedText As String = textAbsorber.Text


Here the exception raised:

System.OutOfRangeException… in System.String.InternalSubStringWithChecks(Int32 startIndex, Int32 length, Boolean fAlwaysCopy)
in Aspose.Pdf.Text.TextAbsorber.( , Boolean )
in Aspose.Pdf.Text.TextAbsorber.Visit(Page page)
in Aspose.Pdf.Page.Accept(TextAbsorber visitor)
in Aspose.Pdf.PageCollection.Accept(TextAbsorber visitor)
in …

Thank you
Federico

Hi Federico,

Thanks for using our products.

I have tested the scenario using Aspose.Pdf for .NET 7.6.0 over Windows 7(X64) in Visual Studio 2010 application where I have set the target platform of application as .NET Framework 4.0_ClientProfile and I am unable to notice any issue. Can you please try using the latest release version and in case the problem still persists, please share some details regarding your working environment.

We are sorry for this inconvenience.

PS, I have observed that the source PDF file contains scanned images. You may consider extracting images from PDF file and perform OCR using Aspose.OCR. Please visit Extract Images from the PDF File and Performing OCR on an Image

Thank you for your quick reply!


My working environment is Windows 7 (x64), VS 2008 and I use Aspose.PDF 7.6.0 as well but as .NET Framework 3.5 client profile.

In the meantime, I will try to use both Extract Images from PDF and Aspose.OCR and i will let you know if I do well.

Other question: with the same code, I succeded in reading some files but the extracted data is something like that: ‘Evaluation Only. Created with Aspose.Pdf. Copyright 2002-2012 Aspose Pty Ltd…’ I get at most few characters (around 100). Is there any limitation since I am using the trial version?

Regards,
Federico

Hi Federico,

Thanks for sharing the details.

Can you please share which version Aspose.Pdf for .NET you have referenced in your project ? i.e. Aspose.Pdf.dll file from folder under bin directory of Aspose.Pdf for .NET installation path. In fact we do not provide any variation of Aspose.Pdf.dll for .NET Framework 3.5_ClientProfile. During my testing, I have set the target platform of my application to .NET Framework 3.5_ClientPfile and when I have tried referencing Aspose.Pdf.dll from net3.5, I am getting errors during code compilation.

Whereas when I have set the target platform of my application to .NET Framework 3.5 and have used Aspose.Pdf.dll from net3.5 folder, I am unable to notice any issue in VisualStudio 2010 application running over Windows 7 (X64). I am not entirely certain that the problem is related to Operating System version but will further look into this problem.

Other question: with the same code, I succeded in reading some files but the extracted data is something like that: ‘Evaluation Only. Created with Aspose.Pdf. Copyright 2002-2012 Aspose Pty Ltd…’ I get at most few characters (around 100). Is there any limitation since I am using the trial version?

It’s a limitation in trial mode. Please request a 30 days temporary license to test the product without any limitations. Follow the instructions specified over following link on how to Get a temporary license

We are sorry for your inconvenience.

Hi, I attach Aspose.PDF.dll that I reference in my project. Please also notice that I am using VS 2008 not VS 2010 as a working environment.


If you need more detail please tell me.

Ciao
Federico


I have also tried to do a simple console application in VS 2010 (.NET 4.0 Client Profile dll) by myself, even if my official working environment is VS 2008, and I still have the same exception with the file I sent you in first post.


I have troubles to upload this whole solution as attachment here but the code is quite simple:

Imports Aspose.Pdf
Imports System.Collections.ObjectModel
Imports System.IO

Module Module1

Sub Main()


For Each file As String In Directory.GetFiles(“D:\Upload”)

Try

Dim asposeDoc As Aspose.Pdf.Document = New Aspose.Pdf.Document(file)


Dim textAbsorber As New Text.TextAbsorber()
'accept the absorber for all the pages
asposeDoc.Pages.Accept(textAbsorber)
'get the extracted text
Dim extractedText As String = textAbsorber.Text
’ “access denied” message checking if the file can be read doesn’t help.

Catch generatedExceptionName As FileNotFoundException
My.Computer.FileSystem.WriteAllText(“D:\log.txt”, String.Format(“Filenotfound : {0} {1} | \n”, file, DateTime.Now), True)

’ Handle any access-denied errors that occur while reading the file.
Catch generatedExceptionName As UnauthorizedAccessException
My.Computer.FileSystem.WriteAllText(“D:\log.txt”, String.Format(“unauthorizedaccess : {0} {1} | \n”, file, DateTime.Now), True)

’ Generic handler for any io-related exceptions that occur.
Catch generatedExceptionName As IOException
My.Computer.FileSystem.WriteAllText(“D:\log.txt”, String.Format(“ioexception : {0} {1} | \n”, file, DateTime.Now), True)

Catch genericEx As Exception
My.Computer.FileSystem.WriteAllText(“D:\log.txt”, String.Format(“genericexception : {0} {1} {2} | \n”, file, DateTime.Now, genericEx.Message), True)

End Try


Next


End Sub

End Module


In the meantime, I requested a temporary licence (all the test I’ve done so far is with Trial version)

Hi Federico,


Thanks for sharing the details.

I have again tested the scenario in Visual Studio 2010 application where I have set the target platform of application to .NET Framework 3.5 and I have not been able to notice the issue when using a valid license file. However when I have used the same environment and did not use license file, I have noticed this exception. So the problem is occurring when using the product in trial mode. Once you have received the license file, please again try extracting the text and in case you still face the similar issue or you have any further query, please feel free to contact.

We are really sorry for this inconvenience.

Hi, I tried now with temporary license and I don’t get exception anymore. Great!


By the way, I am benchmarking Aspose.PDF.dll against my legacy component: for the same set of pdf files I get all the work done in 50 minutes against above 20 of my legacy dll

Is there a performance issue in doing like that for each file:


Dim _pdfLic As New Aspose.Pdf.License
_pdfLic.SetLicense(“C:\Program Files (x86)\Aspose\Aspose.Pdf for .NET\Bin\net3.5\Aspose.Total.lic”)

Thank you
Federico


Hi Federico,


I am glad to hear that your problem related to text extraction is resolved by using a valid license file. Furthermore, please note that the time taken by our component to complete any operation depends upon the structure/complexity of input file and the operation that you are trying to perform. Can you please share some sample PDF files and the code snippet which are causing/producing huge performance differences, and we will try our level best to further improve the performance and reduce the processing time. We are sorry for your inconvenience.