PDF to Excel converting

s.paul4 · July 19, 2017, 10:39am

Hi,

I just downloaded the Aspose.Total package with a temporary license. I am trying to convert a PDF to Excel. I tried doing it directly but it did not work and after going through the forum I found out I may have to convert it to an xml file first. I tried that too but the file does not get converted. I am using the below code:

com.aspose.pdf.Document oPDFDoc = new Document(txtInputText.getText());
//System.out.println(oPDFDoc.getFileName());
com.aspose.pdf.ExcelSaveOptions excelsave = new ExcelSaveOptions();
oPDFDoc.save("C:\\Users\\Sam_paul\\Documents\\Sam.xml", excelsave);

Is there something I am doing wrong ?

Thank you,
Sam

codewarior · July 19, 2017, 1:08pm

@s.paul4,

Thanks for contacting support.

The reason we recommended customers to save output with .XML file is because MS Excel do not show a prompt message when viewing the document. Can you please share the input PDF file, so that we can test the scenario in our environment. We are sorry for this inconvenience.

s.paul4 · July 19, 2017, 2:55pm

I did a bit of tinkering and I found that the conversion only fails when I try to convert a scanned pdf to excel. If I try to convert a normal pdf it works just fine. Is there a way to extract the data from a scanned pdf or at least convert it to a normal pdf

I have attached a copy of the pdf and scanned pdf I used.Statement_Sam_Scanned_red.pdf (43.3 KB)
Statement_Sam_red.pdf (298.5 KB)

asad.ali · July 20, 2017, 9:14am

@s.paul4

Thanks for sharing input document.

We have also observed that Aspose.Pdf API was generating a blank XLS document in case of scanned PDF. Hence we have logged an issue as PDFNET-43066 in our issue tracking system, for the sake of correction. However, as a workaround, you can extract images from PDF with Aspose.Pdf and perform OCR operation on that image with Aspose.OCR to extract text.

For further information, please visit following helpful links.

We will further look into the details of logged issue and keep you updated with the status of its rectification within this forum thread. Please be patient and spare us little time.

We are sorry for the inconvenience.

s.paul4 · July 20, 2017, 2:30pm

I tried do what is explained in the documentation you provided but the output is junk values. The pdf is converted to jpeg (300 dpi) but when I try to read it using the OCREngine it does not work. I have attached a screen shot of the code I used and the console output

OCR conversion_Aspose.png (342.4 KB)

Am I doing something wrong here ?

ikram.haq · July 20, 2017, 9:09pm

@s.paul4,

We have evaluated the sample. It was found that OCR operation is returning invalid results. The issue has been logged into our system with ID OCRJAVA-776. Once any update is available on the issue, we will share it with you via this forum thread.

You can give it a try again by applying different filters to gain a better accuracy. Furthermore complete features can be tested by using the temporary license for 30 days. Please follow the link for details on how to get temporary license.

s.paul4 · July 24, 2017, 3:44pm

Do you have an update on any of these issues?

asad.ali · July 24, 2017, 7:32pm

@s.paul4

Thanks for your inquiry.

As we recently have been notified about the issue, so I am afraid that it is pending for review. We are sure that relevant team will plan to investigate the issue as per their development schedule.

Please note that currently, product team has been busy in resolving other issues in the queue as well as adding new features and enhancements to the API, so I am sorry that we cannot share any reliable ETA for now.

However, as soon as we get some definite updates regarding resolution of the issue(s), we will let you know within this forum thread. Please be patient and spare us little time.

We are sorry for the inconvenience.

s.paul4 · August 3, 2017, 2:00pm

Hi,

Is there any update on this ?

imran.rafique · August 4, 2017, 3:17am

@s.paul4,
Thank you for the inquiry. Unfortunately, there are no updates on the both linked ticket IDs PDFNET-43066 and OCRJAVA-776. We have logged the ETA requests under the same ticket IDs and will let you know once significant progress has been made.

Best Regards,
Imran Rafique

aspose.notifier · December 5, 2019, 10:21pm

The issues you have found earlier (filed as PDFNET-43066) have been fixed in Aspose.PDF for .NET 19.12.