20.10 PDF to EXCEL conversion very poor column detection compared to 20.1

michael.sommers · October 28, 2020, 4:34pm

Version 20.1 locates columns pretty well. Version 20.10 not nearly as good. Attached is the code, a 20.1 output, and a 20.10 output.AsposeCodeSample.png (60.9 KB)
AsposeRev20.1.png (91.5 KB)
AsposeRev20.10.png (46.1 KB)

asad.ali · October 28, 2020, 9:58pm

@michael.sommers

Would you please also share the sample PDF file for our reference. We will log an issue in our issue tracking system and share the ID with you.

michael.sommers · October 29, 2020, 6:54pm

Here is a PDF that shows that the 20.1 column detection works much better than the 20.10 column detection. If you can’t change it back to be the way it used to be, can you give us a property to set to get it to work the better way?Example.pdf (59.5 KB)
Example.pdf (59.5 KB)

asad.ali · October 30, 2020, 7:58am

@michael.sommers

In newer versions, we have introduced a new Conversion Engine. You can please try using following code snippet in order to obtain better conversion results. Please let us know in case you still face any issue:

Document document = new Document(dataDir + "Example.pdf");
ExcelSaveOptions saveOptions = new ExcelSaveOptions();
saveOptions.MinimizeTheNumberOfWorksheets = true;
saveOptions.ConversionEngine = ExcelSaveOptions.ConversionEngines.NewEngine;
document.Save(dataDir + "output.xls", saveOptions);

michael.sommers · October 30, 2020, 3:49pm

ExcelSaveOptions.ConversionEngines.NewEngine is the default, and is the cause of the problem. The new engine does a poor job of detecting columns because it gets distracted by report headers above the column formatting. Column detection is the most important function of pdf-to-excel; we chose ASPOSE over the competition because of the good column detection of the legacy engine. The poor performance of the new engine will cost you sales.

asad.ali · October 30, 2020, 8:26pm

@michael.sommers

The attached output file was generated using legacy engine:

saveOptions.ConversionEngine = ExcelSaveOptions.ConversionEngines.LegacyEngine;

output.zip (7.3 KB)

Would you please check it and let us know in case it also does not suit your requirements. We will surely work on improving the conversion in order to make it suitable for you.

michael.sommers · October 30, 2020, 11:11pm

Thank you! While this is much better than 20.10, there are two columns that are not separated. We have to manually correct all of these. I’ve highlighted the cells in question. If we could fix it without messing up anything else it would be great.OutputHighlighted.png (22.7 KB)

asad.ali · November 1, 2020, 8:08pm

@michael.sommers

Thanks for your feedback.

We have logged an issue as PDFNET-48975 in our issue tracking system related to the conversion problems in latest version of the API. We will surely work over resolving it and let you know as soon as its fix is available. Please be patient and spare us some time.

We are sorry for the inconvenience.