Conversion of PDF to Word, table support?

drcs · September 3, 2017, 4:32pm

Hi there,

The PDF conversion process to Word gives inferior results to say PDF to Word output from Adobe Acrobat. We’ve tried using flow, but the results were not that much different. Specifically we’re looking at converting documents that contain a mix of headings, paragraphs and tables, and retaining that in the converted Word document. At the moment it appears that tables are converted to images.

We have been testing using the trail version as this is a very specific use case we have for conversion of a large number of PDFs to Word for content editing purposes.

I’ve attached the sample pdf and docx converted using aspose.pdf for Java.

test.pdf (509.3 KB)
samples.zip (586.7 KB)

Many thanks,
Dave

imran.rafique · September 4, 2017, 5:38am

@drcs,
We have converted your source PDF to DOCX, and we can find tables in the output DOCX file Test_out_flowDOCX.zip (68.0 KB). The look and feel of the output DOCX file are same as compare to the source PDF.

However, when we select an image inside the table’s cell, then we can notice that the image size is large and also grabs style of the table rows. We have logged an investigation under the ticket ID PDFJAVA-37039 in our issue tracking system. We have linked your post to this ticket and will keep you informed regarding any available updates.

drcs · September 4, 2017, 7:09am

Hi Imran,

Thanks for the response. The converted document shows the same results. The table is not a table, it’s a picture, at least in the document sent over (see the screenshot - I removed the inserted image from the cell to test, but the result was the same):

Screen Shot 2017-09-04 at 09.24.03.png (28.0 KB)

I know they look the same, but I was hoping there was a way to convert tables from the PDF to proper tables in the Word document for editing purposes.

The code I am using is pretty basic (from the examples):

import com.aspose.pdf.DocSaveOptions;
import com.aspose.pdf.Document;
import com.aspose.pdf.SaveFormat;

public class ConvertPDFToDOCOrDOCXFormat {
        
        private static final String dataDir = "testdocs/";
        
        public static void main(String[] args) {
                usingTheDocSaveOptionsClass();
        }

        public static void usingTheDocSaveOptionsClass() {
                // Open a document
                // Path of input PDF document
                String filePath = dataDir + "test.pdf";
                // Instantiate the Document object
                Document document = new Document(filePath);
                // Create DocSaveOptions object
                DocSaveOptions saveOption = new DocSaveOptions();
                // Set format DOCX
                saveOption.setFormat(DocSaveOptions.DocFormat.DocX);
                // Set the recognition mode as Flow
                saveOption.setMode(DocSaveOptions.RecognitionMode.Flow);
                // Set the Horizontal proximity as 2.5
                saveOption.setRelativeHorizontalProximity(2.5f);
                // Enable the value to recognize bullets during conversion process
                saveOption.setRecognizeBullets(true);
                // Save the resultant DOC file
                document.save(dataDir + "test.docx", saveOption);
        }

}

Maybe I’ve missed something in the configuration of DocSaveOptions?

Thanks,
Dave.

imran.rafique · September 4, 2017, 8:00am

@drcs,
We managed to find the same issue after executing your code. We have recorded your concern under the same ticket ID PDFJAVA-37039 in our bug tracking system. We will let you know once a significant progress has been made in this regard.

Lemjid · March 8, 2019, 2:17pm

@imran.rafique :
Have a solution to the chalkboard problem?
Please, I need your answer quickly

asad.ali · March 8, 2019, 7:20pm

@Lemjid

We regret to share that earlier logged issue is not yet resolved. Please note that this is known issue in the API and we definitely have plans to resolve it in future. However, due to other high priority issues in the queue, we cannot make promises to resolve this any time sooner. As soon as we make some significant progress towards issue resolution, we will let you know. Please spare us little time.

We are sorry for the inconvenience.