Unable to read text from PDF document

kamalkishore2014 · June 7, 2016, 4:55am

I am trying to read the contents from one of my PDF which is a multi language document . I tried with PDF Extractor as well as with TextAbsorber both but it’s giving an error “Index was outside the bounds of the array” at line “extractor.ExtractText(Encoding.ASCII)”. Please find the attached document and below code which i used to extract the pdf content :

private static string GetPdfFileContents(string fileName)

{

PdfExtractor extractor = new PdfExtractor();

//bind PDF file with the extractor object

extractor.BindPdf(fileName);

return GetPdfFileContents(extractor);

}

private static string GetPdfFileContents(PdfExtractor extractor)

{

string contents = string.Empty;

string tempPath = Path.GetDirectoryName(Assembly.GetExecutingAssembly().Location) + @"/Temp";

string outFilePath = tempPath + @"/temptOut.txt";

if (!Directory.Exists(tempPath))

{

Directory.CreateDirectory(tempPath);

}

//extract all text from the PDF

extractor.ExtractText(Encoding.ASCII);

//save extracted text in a text file

extractor.GetText(outFilePath);

contents = System.IO.File.ReadAllText(outFilePath);

System.IO.File.Delete(outFilePath);

return contents;

}

tilal.ahmad · June 7, 2016, 11:54pm

Hi Kamal,

Thanks for your inquiry. I have tested your scenario with shared document using Aspose.Pdf for .NET 11.7.0 and managed to observe the reported exception. For further investigation, I have logged an issue in our issue tracking system as PDFNEWNET-40903 and also linked your request to it. We will keep you updated via this thread regarding the issue status.

We are sorry for the inconvenience caused.

aspose.notifier · April 5, 2019, 9:39pm

The issues you have found earlier (filed as PDFNET-40903) have been fixed in Aspose.PDF for .NET 19.4.