Extract Text from PDF document using Aspose.PDF for .NET - index out of range exception

CoastalCIU · July 20, 2020, 9:01pm

I’m getting an index out of range when running a .ExtractText method against the following file.

http://www2.cybercom-intl.com/transfer/project_manual.pdf

aspose.PDF.dll Version 20.6.0.0 6/1/2020

asad.ali · July 21, 2020, 1:31pm

Could you please share the sample code snippet that you are using to extract the text from this PDF. We will test the scenario in our environment and address it accordingly.

CoastalCIU · July 21, 2020, 2:16pm

Dim byteDoc As Byte()
byteDoc = System.IO.File.ReadAllBytes(“c:\temp\project_manual.pdf”)
Dim msPDF As MemoryStream = New MemoryStream(byteDoc)
Dim extractor As New PdfExtractor
extractor.BindPdf(msPDF)

Dim msPDFText As New MemoryStream()

extractor.ExtractText()
extractor.GetText(msPDFText)

asad.ali · July 21, 2020, 9:35pm

@CoastalCIU

We were able to observe the issue in our environment while using Aspose.PDF for .NET 20.7. We also tried using following code snippet and faced similar exception:

TextAbsorber ta = new TextAbsorber();
Document pdfDocument = new Document(dataDir + "project_manual.pdf");
pdfDocument.Pages.Accept(ta);

Therefore, we have logged an issue as PDFNET-48571 in our issue tracking system. We will further look into its details and keep you posted with its rectification status. Please be patient and spare us some time.

We are sorry for the inconvenience.

CoastalCIU · July 21, 2020, 10:56pm

Thank you very much for your help! This is a production issue and any assistance is appreciated.

asad.ali · July 22, 2020, 3:13pm

@CoastalCIU

We will surely resolve the issue which has recently been logged. However, it will be resolved/investigated on first come first serve basis. We will let you know as soon as we have additional updates regarding ticket resolution.

aspose.notifier · February 15, 2024, 9:05pm

The issues you have found earlier (filed as PDFNET-48571) have been fixed in Aspose.PDF for .NET 24.2.