PDF to DOCX Conversion Does Not Create Editable Tables and Outlines

Comparing to the Microsoft.Office.Interop.Word app, the Aspose.Pdf Save method does not appear to render tables in the expected original format. They are picture frames with embedded text and so are not editable in the converted docx file. Numbered outlines do not have the same indentation and spacing compared to the Word app conversion. Is this a conversion limitation, or are there other options available?

In the code below, I am using the RecognitionMode.Flow option:

		string strSrcFilePath = @"C:\temp\InputDocument.pdf";

		using(Aspose.Pdf.Document theDoc = new Aspose.Pdf.Document(strSrcFilePath))
		{
			Aspose.Pdf.DocSaveOptions docSaveOptions = new Aspose.Pdf.DocSaveOptions();
			docSaveOptions.Format = Aspose.Pdf.DocSaveOptions.DocFormat.DocX;
			docSaveOptions.Mode = Aspose.Pdf.DocSaveOptions.RecognitionMode.Flow;
			docSaveOptions.RecognizeBullets = true;
			theDoc.Save(@"C:\temp\AsposePDF_ConvertedDocument.docx",docSaveOptions);
		} // using Aspose.Pdf.Document

@cpdev,

Kindly send us your source Pdf document. We will investigate and share our findings with you. Your response is awaited.

AsposeTestDoc2.pdf (279.3 KB)

PDF document attached.

@cpdev,

We have converted your source PDF with the latest version 17.11 of Aspose.Pdf for .NET API. All text items are editable and numbered outlines are aligned. This is the output DOCX: AsposePDF_ConvertedDocument.zip (102.2 KB). Kindly review and let us know if you find any problematic behavior.

WordApp_ConvertedDocument.zip (32.2 KB)

All text items are editable but there are two observations to note.

  1. The converted tables are not true Word tables. The Aspose version has a picture line grid overlayed over the text cells, whereas the WordApp version has rendered a true table which can be selected when hovering over the upper left corner of the table.

  2. The numbered outline text has some artifacts where some text lines are not rendered properly on separate lines. This can be discovered by selecting a block of text and copy/pasting into WordPad. The WordApp version shows a more expected rendering.

Attached is the WordApp converted document for comparison. I am mostly concerned about the table formatting.

@cpdev,

We managed to replicate the said issues in our environment. We have logged tickets in our issue tracking system as follows:

PDFNET-43817: PDF to DOCX procedure do not render true Word tables
PDFNET-43818: PDF to DOCX - the incorrect rendering of the numbered outline text

We have linked your post to these tickets and will keep you informed regarding any available updates.

We are having the same issue, the overlayed table border image doesnt always even cover the content properly, will be much better if an actuall word table with border. I cant see the attached issues on this ticket, do they have a status? Is this something being actively worked on? Is there an expected timeline for resolution? Many thanks.

@brotheredward,

Both these tickets (PDFNET-43817 and PDFNET-43818) have been identified recently and pending for the analysis. Our product team will investigate as per their development schedules. We recommend you please create a separate thread and share your problematic documents with the complete details including code. We will investigate and share our findings with you. If the problems are same, even then your scenario will be verified once the root cause is fixed.