PDF to DOCX Conversion tables are not converted properly For JAVA

Hi Support Team,

Currently, I’m using temp licence before purchase licence for JAVA.

This is my code:
Document convertPDFDocumentToWord = new
Document(“LetterTemplate_83807.pdf”);
DocSaveOptions docSaveOptions = new DocSaveOptions();

     docSaveOptions.setFormat(DocSaveOptions.DocFormat.DocX);
     
     docSaveOptions.setMode(DocSaveOptions.RecognitionMode.Flow);
     
     docSaveOptions.setRelativeHorizontalProximity(2.5f);
     
     docSaveOptions.setRecognizeBullets(true);

     convertPDFDocumentToWord.save("output.docx", docSaveOptions);

The output having tables without proper borders . When editing tables texts are overlapping with other columns
I am attaching input pdf and output docx.
I am using aspose-pdf 22.11 version and java 8.
LetterTemplate_83807.pdf (29.5 KB)
output.docx (29.9 KB)

@mithiit007

We have logged this problem in our issue tracking system as PDFJAVA-42345. You will be notified via this forum thread once this issue is resolved.

We apologize for your inconvenience.

Do you have any timeline to analyze or fix the issue so that we can discuss with our clients regarding licensing and pricing ?

@mithiit007

Currently, your issue is pending for analysis and is in the queue. Once our product team completes the analysis of your issue, we will then be able to provide you an estimate.

@tahir.manzoor , please let me know once analysis is complete.

@mithiit007

We will be sure to inform you once there is an update available on it. Thanks for your patience and understanding.

Hi @tahir.manzoor status of the issue PDFJAVA-42345 is resolved . Please update me .

@mithiit007

Aspose.PDF converters in Flow and TextBox recognition modes are unable to recognize tables. They are rendered just as text over images with borders.

Now we are actively working on our new engine (in the beta state now) that will be activated when using EnhancedFlow recognition mode. In this mode, tables are real Word tables. Please see the code snippet below:

Document convertPDFDocumentToWord = new Document(MyDir + "LetterTemplate_83807.pdf");
DocSaveOptions docSaveOptions = new DocSaveOptions();
docSaveOptions.setFormat(DocSaveOptions.DocFormat.DocX);
docSaveOptions.setMode(DocSaveOptions.RecognitionMode.EnhancedFlow);

// docSaveOptions.setRelativeHorizontalProximity(2.5f); - not used in EnhancedFlow mode
// docSaveOptions.setRecognizeBullets(true);            - always true in EnhancedFlow mode  

convertPDFDocumentToWord.save(MyDir + "output.docx", docSaveOptions);

Please note that in this mode the output document has a minor issue (the first and second text lines are merged). We have planned to fix this issue in January 2023 release. The ticket ID for this issue is PDFJAVA-42361.

@tahir.manzoor I have tried Enhanced Flow recognition mode in Aspose.PDF for Java version 22.9, and it is detecting tables now, but there still seems to be a couple of issues:

  • Images are removed from the document
  • Cells that contain multiple lines are converted to multiple rows
  • Some paragraphs, especially numbered list items, are falsely interpreted as tables

Will these issues be fixed in the January release?

Also, your online PDF to Word converter (which uses Aspose.Words for .NET) is much more accurate than the Java version. It also handles multiple lines per cell correctly, and even detects numbered and bulleted list item paragraphs. Is it not perhaps possible to port the .NET implementation to Java so that it can also have the same accuracy?

@jacogericke

Could you please share the problematic output DOCX along with screenshots of issues? We will then investigate the issue and provide you more information on it.

@tahir.manzoor please see the examples attached: Examples.zip (234.6 KB)

One example contains a table with multiple lines per cell and the other example has numbered items incorrectly converted to tables. I have the output for both Aspose .NET/Online conversion and Java conversion. Java conversion was performed with Aspose.PDF for Java version 22.12

@jacogericke

Thanks for sharing sample files and further details. We have also observed the same issue in our environment while testing the scenario. Hence, another dedicated ticket as PDFJAVA-42410 has been logged in our issue tracking system. We will look into its details and keep you posted with the status of its correction. Please be patient and spare us some time.

We are sorry for the inconvenience.

The issues you have found earlier (filed as PDFJAVA-42361) have been fixed in Aspose.PDF for Java 23.1.

I have been notified that PDFJAVA-42361 (textlines are merged) has been resolved. I have tested this in Aspose.PDF for Java 23.1 by converting a PDF to a Word document using the ENHANCED FLOW recognition method, but it is still putting each line of a paragraph or bulleted list inside its own row. Please see the input PDF and output Word document attached.
Input: Example 1.pdf (126.8 KB)
Output: Example 1.docx (14.0 KB)

@jacogericke

Another ticket as PDFJAVA-42456 has been logged in our issue management system to check recently shared file and address the issue accordingly. We will let you know in case we have further updates. We apologize for the inconvenience faced.

@mithiit007

Our converters in Flow and TextBox recognition modes are unable to recognize tables: they are rendered just as text over images with borders.

Now we are actively working on our new engine (in the beta state now) that will be activated when using EnhancedFlow recognition mode. In this mode, tables are real word tables. Please see the code snippet below:

Document convertPDFDocumentToWord = new Document(MyDir + "LetterTemplate_83807.pdf");
DocSaveOptions docSaveOptions = new DocSaveOptions();
docSaveOptions.setFormat(DocSaveOptions.DocFormat.DocX);
docSaveOptions.setMode(DocSaveOptions.RecognitionMode.EnhancedFlow);

// docSaveOptions.setRelativeHorizontalProximity(2.5f); - not used in EnhancedFlow mode
// docSaveOptions.setRecognizeBullets(true);            - always true in EnhancedFlow mode  

convertPDFDocumentToWord.save(MyDir + "output.docx", docSaveOptions); 

Please try to use the 23.1 version with the shared code snippet and let us know if issue still persists.