Saving PDF as Word or HTML Fails to Maintain Tables

I’ve noticed some strange inconsistent behaviors trying to save a PDF containing tables as a Word or HTML document.

I create a simple Word document with tables and then save as a PDF. I then read the PDF using Aspose and save as either Word or HTML. In neither case, are tables maintained in the saved Word or HTML. Also, note that Aspose PDF does correctly identify the tables when calling Aspose APIs to get all the tables from the PDF. However, these tables are not maintained while saving to Word or HTML. Each cell becomes a paragraph. Additionally, I used Acrobat Pro to verify that the PDF indeed contains table tags used for accessibility. So, it appears that Aspose is capable of reading the tables but does not write tables when saving Word or HTML.

Is Aspose working a fix for this?

@juane3729,

Kindly send us the complete details of the scenario, including source PDF (saved by Microsoft Office Word), code and problematic behavior. We will investigate and share our findings with you.

Here is the input file, output file and the code. Also, note that I opened input.pdf in Adobe Acrobat DC and verified that it contains accessibility table tags.

    Document pdfDocument = new Document(new FileInputStream("input.pdf"));

    DocSaveOptions saveOptions = new DocSaveOptions();
    saveOptions.setFormat(DocSaveOptions.DocFormat.DocX);
    saveOptions.setMode(DocSaveOptions.RecognitionMode.Flow);

    pdfDocument.save("output.docx", saveOptions);

input.pdf (11.3 KB)
output.docx.zip (13.2 KB)

@juane3729,

We managed to replicate the said issues in our environment. We have logged tickets in our bug tracking system as follows:

PDFNET-43855: PDF to DOCX - the table is not being maintained
PDFNET-43856: PDF to HTML - the table is not being maintained

We have linked your post to these tickets and will keep you informed regarding any available updates.

The issues you have found earlier (filed as PDFNET-43855) have been fixed in Aspose.PDF for .NET 23.6.