The output having tables without proper borders . When editing tables texts are overlapping with other columns
I am attaching input pdf and output docx.
I am using aspose-pdf 22.11 version and java 8. LetterTemplate_83807.pdf (29.5 KB) output.docx (29.9 KB)
Currently, your issue is pending for analysis and is in the queue. Once our product team completes the analysis of your issue, we will then be able to provide you an estimate.
Aspose.PDF converters in Flow and TextBox recognition modes are unable to recognize tables. They are rendered just as text over images with borders.
Now we are actively working on our new engine (in the beta state now) that will be activated when using EnhancedFlow recognition mode. In this mode, tables are real Word tables. Please see the code snippet below:
Document convertPDFDocumentToWord = new Document(MyDir + "LetterTemplate_83807.pdf");
DocSaveOptions docSaveOptions = new DocSaveOptions();
docSaveOptions.setFormat(DocSaveOptions.DocFormat.DocX);
docSaveOptions.setMode(DocSaveOptions.RecognitionMode.EnhancedFlow);
// docSaveOptions.setRelativeHorizontalProximity(2.5f); - not used in EnhancedFlow mode
// docSaveOptions.setRecognizeBullets(true); - always true in EnhancedFlow mode
convertPDFDocumentToWord.save(MyDir + "output.docx", docSaveOptions);
Please note that in this mode the output document has a minor issue (the first and second text lines are merged). We have planned to fix this issue in January 2023 release. The ticket ID for this issue is PDFJAVA-42361.
@tahir.manzoor I have tried Enhanced Flow recognition mode in Aspose.PDF for Java version 22.9, and it is detecting tables now, but there still seems to be a couple of issues:
Images are removed from the document
Cells that contain multiple lines are converted to multiple rows
Some paragraphs, especially numbered list items, are falsely interpreted as tables
Will these issues be fixed in the January release?
Also, your online PDF to Word converter (which uses Aspose.Words for .NET) is much more accurate than the Java version. It also handles multiple lines per cell correctly, and even detects numbered and bulleted list item paragraphs. Is it not perhaps possible to port the .NET implementation to Java so that it can also have the same accuracy?
Could you please share the problematic output DOCX along with screenshots of issues? We will then investigate the issue and provide you more information on it.
One example contains a table with multiple lines per cell and the other example has numbered items incorrectly converted to tables. I have the output for both Aspose .NET/Online conversion and Java conversion. Java conversion was performed with Aspose.PDF for Java version 22.12
Thanks for sharing sample files and further details. We have also observed the same issue in our environment while testing the scenario. Hence, another dedicated ticket as PDFJAVA-42410 has been logged in our issue tracking system. We will look into its details and keep you posted with the status of its correction. Please be patient and spare us some time.
I have been notified that PDFJAVA-42361 (textlines are merged) has been resolved. I have tested this in Aspose.PDF for Java 23.1 by converting a PDF to a Word document using the ENHANCED FLOW recognition method, but it is still putting each line of a paragraph or bulleted list inside its own row. Please see the input PDF and output Word document attached.
Input: Example 1.pdf (126.8 KB)
Output: Example 1.docx (14.0 KB)
Another ticket as PDFJAVA-42456 has been logged in our issue management system to check recently shared file and address the issue accordingly. We will let you know in case we have further updates. We apologize for the inconvenience faced.
Our converters in Flow and TextBox recognition modes are unable to recognize tables: they are rendered just as text over images with borders.
Now we are actively working on our new engine (in the beta state now) that will be activated when using EnhancedFlow recognition mode. In this mode, tables are real word tables. Please see the code snippet below:
Document convertPDFDocumentToWord = new Document(MyDir + "LetterTemplate_83807.pdf");
DocSaveOptions docSaveOptions = new DocSaveOptions();
docSaveOptions.setFormat(DocSaveOptions.DocFormat.DocX);
docSaveOptions.setMode(DocSaveOptions.RecognitionMode.EnhancedFlow);
// docSaveOptions.setRelativeHorizontalProximity(2.5f); - not used in EnhancedFlow mode
// docSaveOptions.setRecognizeBullets(true); - always true in EnhancedFlow mode
convertPDFDocumentToWord.save(MyDir + "output.docx", docSaveOptions);
Please try to use the 23.1 version with the shared code snippet and let us know if issue still persists.