Aspose PDF.Net convert a PDF to XLS

athnetix · May 21, 2014, 11:08am

Here’s my question. I would like to convert a PDF document to XLS format in PDF.NET. I have used the code:

Dim doc As Aspose.Pdf.Document = New Aspose.Pdf.Document(“C:\HelloWorld.pdf”)

’ instantiate ExcelSave Option object

Dim excelsave As Aspose.Pdf.ExcelSaveOptions = New ExcelSaveOptions()

’ save the output in XLS format

doc.Save(“c:/HelloWorld.xls”, excelsave)

This code converts SOME PDF document to XLS.

Can you tell me why other PDF documents don’t convert?

Is it due to security on the PDF document that doesn’t allow access to the data in the PDF document?

If so is there a way around this?

Also can the PDF document data be changed or manipulated prior to saving as XLS?

I have attached the PDF doc. I couldn’t convert to a XLS format.

Thanks

tilal.ahmad · May 21, 2014, 11:34pm

Hi David,

We are sorry for the inconvenience caused. While testing the scenario with the latest version of Aspose.Pdf for .NET 9.2.1, we have managed to reproduce the reported issue and logged it in our bug tracking system as PDFNEWNET-36948 for further investigation and resolution. Moreover, I am afraid there is no workaround at the moment. We will notify you via this thread as soon as it is resolved.

Please feel free to contact us for any further assistance.

Best Regards,

tilal.ahmad · June 5, 2014, 11:46am

Hi David,

Thanks for your patience. We have further investigate and found that your document is non-searchable PDF. It doesn’t contain text. Therefore it can’t be converted into XLS directly. To make PDF document searchable some kind of OCR (optical character recognition) software is needed. Then searchable pdf can be converted into XLS.

As a workaround you can use free google tesseract OCR application to convert your non-searchable PDF document to searchable PDF document and convert it to XLS using Aspose.Pdf. Please check following code snippet for the purpose.

Please install google tesseract OCR on your computer from http://code.google.com/p/tesseract-ocr/downloads/list and after that you will have tesseract.exe console application.

Below you can see usage example:

private void DoWork()<o:p></o:p>

{

Document doc = new Document(@"TestPDF.pdf");

doc.Convert(CallBackGetHocr);

doc.Save("output.xls", new ExcelSaveOptions());

}

private string CallBackGetHocr(System.Drawing.Image img)

{

string dir = @"c:\tmp\";

img.Save(dir + "test.jpg");

ProcessStartInfo info = new ProcessStartInfo(@"tesseract.exe");

info.WindowStyle = ProcessWindowStyle.Hidden;

info.Arguments = @"c:\tmp\test.jpg c:\tmp\out hocr";

Process p = new Process();

p.StartInfo = info;

p.Start();

p.WaitForExit();

StreamReader streamReader = new StreamReader(@"c:\tmp\out.html");

string text = streamReader.ReadToEnd();

streamReader.Close();

return text;

}

Please feel free to contact us for any further assistance.

Best Regards,

aspose.notifier · July 9, 2014, 7:59am

The issues you have found earlier (filed as PDFNEWNET-36948) have been fixed in Aspose.Pdf for .NET 9.4.0.

This message was posted using Notification2Forum from Downloads module by Aspose Notifier.