Converting PDF to DOC with 4.5.0 Java: Issues and Concerns

Hello. I’m posting about PDF-to-DOC conversion in the latest 4.5.0 version of Aspose PDF.

In another thread, I mentioned that .docx conversion does not seem to be supported (out of range error on the SaveFormat), so for these examples I used conversion to .doc.

I started with two PDF files. They are attached. One is a shorter example, and one is longer. They are named accordingly.

For reference and comparison, I converted the shorter file using several desktop utilities. They all performed equivalently, so I chose one of those converted files and included it as well.

The following are some issues we have with the conversion process from aspose:

  • Basic paragraph handling: paragraphs in the .pdf are not converted into proper paragraphs. Text is converted into floating text boxes or frames in the .doc version. Every line has a hard line break at the end. This makes it impossible to reformat text. For example, take a paragraph in the converted doc and choose a different font size; the line breaks stay fixed and do not allow the text to wrap to fit the new size. Nor do paragraphs below flow up or down to accommodate the new font size.
    • Contrast this behavior with the version converted with desktop software. The paragraphs are in proper format and do not have hard-coded line breaks. This allows the user to apply styles and rearrange the text. Resizing text causes the remainder of the document to flow as expected.
  • Basic list handling: bulleted lists use strange characters and bizarre overlapping frames to simulate the ‘look’ of a bulleted list. But the lists can not be manipulated. For example, if I want to insert a bullet in between other bullets using e.g. MS Word, it can not be done. The text moves down but the bullets stay fixed in a separate, horizontally overlapping frame.
    • Contrast this behavior with the version converted with desktop software. Bulleted lists convert to actual bulleted lists that can be manipulated and edited.
  • More advanced list handling: In the longer document (testLongerTemplateConvertedViaAspose.doc), sometimes lists are broken across multiple frames/text boxes, causing more misformatting. For example, look at how the list is broken into multiple frames between list items 4.8 and 4.9.
  • Simple graphics: A simple graphic did come over, but it is sized incorrectly and is floating in an absolutely positioned frame. This makes wrapping text or other functions affecting the flow of the document impossible to perform.
    • Contrast this with the desktop software conversion which inserts the image inline with the text as one would expect; adding lines above this image moves it down the page. The image is a proper element of the document flow.
  • Random characters: Turn on paragraph markings and look in the blue headers to see strange inserted characters in the converted version; they are indicated in MS Word with a (?).
  • Simple tables: The simple table with the figures did not translate at all; it is entirely misformatted in the resulting document.
    • Contrast this with the desktop software conversion which preserves all aspects of the table correctly, including font, merged cells, and simple borders.
  • Underline: In the longer document, notice how the headers (for example, 4.0) is only partially underlined in the converted version.
    • Contrast this with the desktop version which preserves underline as expected.
  • Performance: while the shorter document took only a few seconds, the longer document took over 90 seconds to convert. This is under ColdFusion 10. CPU usage remained high throughout the duration of the operation.

In short, most users presumably need PDF to DOCX conversion in order to change a document into an editable format. The Aspose Java PDF conversion attempts to make the resulting document “look” like it was converted, but the result is not readily editable, nor is it particularly accurate to the original version.

Does Apose PDF for .NET perform a better conversion than this? If so, would it be possible to share the conversion of testLongerTemplate.pdf so that we can view the differences?

Obviously there are technical challenges converting PDF to DOCX;
however, desktop software exists to do so, and it does the job
very competently. I would expect Aspose PDF to Word conversion should
offer similar fidelity in conversion, with optimized performance for server-based
conversion.

This feature has been under development for several years. While we are glad that it has reached a point of being shared with the public, we are disappointed in the results because it does not accomplish the task of creating an editable DOCX document from the PDF version. I ask that the team consider giving this feature high priority to improve to make Aspose.PDF a much stronger product.

Thank you.


Hi,


Thanks for contacting support.

We are working over this query and will get back to you soon. We are sorry for this delay and inconvenience.

Hi Chris,


Thanks for your detailed overview, definitely it will help us to improve the PDF to DOC feature. We have managed to notice the reported issues and logged in our issue tracking system as following. We will keep you updated about the issues resolution progress. Moreover, please find attached PDF to DOC conversion using Aspose.Pdf for .NET API.

PDFNEWJAVA-33929: Paragraph formatting issue
PDFNEWJAVA-33930: List handling issue
PDFNEWJAVA-33931: Extra space between list items
PDFNEWJAVA-33932: Graphic rendering issue
PDFNEWJAVA-33933: Random Characters in heading
PDFNEWJAVA-33934: Table rendering issue
PDFNEWJAVA-33935: Underline rendering issue
PDFNEWJAVA-33936: Performance issue

We are sorry for the inconvenience caused.

Best Regards,


Thank your for the .NET example Tilal. It looks like the .NET version of Aspose PDF is equivalent to the Java version, with similar conversion issues, at least with this limited example.

For reference, there are several cloud-based services, such as http://www.pdfonline.com/pdf-to-word-converter/, that perform a highly accurate PDF to Word conversion. I used this on the same document as above and the results were a good as the desktop software.

I hope that Aspose can one day provide a similar level of rendition by preserving document flow during conversion.

Thank you.

Hi,


Please note that Aspose.Pdf for Java 4.5.0 is the first release in which we have introduced PDF to DOC conversion feature. Nevertheless, our development team is working hard on making this feature robust and mature enough so that it can produce resultant files identical to input PDF files.

Hello,

I have two comments/questions.

First, I have noticed another issue in converting PDF to Word with Aspose.PDF. We are using the latest version of Aspose.PDF as of today. Attached is a simple PDF and the converted Word doc. The image in the document is converted partially, but as you can see it gets cut off along its height. For wider images, this occurs along the width as well.

Second, are you able to provide an indication of ETA, or what the priority is on some of the other PDF-to-Word issues? These include:

PDFNEWJAVA-33929
PDFNEWJAVA-33930
PDFNEWJAVA-33932
PDFNEWJAVA-33931
PDFNEWJAVA-33933
PDFNEWJAVA-33934
PDFNEWJAVA-33935
PDFNEWJAVA-33936

I would appreciate knowing when we might hope to see some of them addressed or if they are being worked currently.

Thank you.


Hi Chris,


Thanks for your inquiry. I am afraid your reported issues have not been resolved completely. Most of the issues will be resolved in upcoming release, Aspose.Pdf for Java 9.0.0. Please note it would be the ported version of Aspose.Pdf for .NET 9.0.0 and please find sample output generated using Aspose.Pdf for .NET 9.0.0, it is quite improved.

Best Regards,
backprop:
First, I have noticed another issue in converting PDF to Word with Aspose.PDF. We are using the latest version of Aspose.PDF as of today. Attached is a simple PDF and the converted Word doc. The image in the document is converted partially, but as you can see it gets cut off along its height. For wider images, this occurs along the width as well.
Hi Chris,

I have tested the scenario and I am able to notice the above stated problem. For the sake of correction, I have separately logged this problem as PDFNEWJAVA-34101 in our issue tracking system. We will further look into the details of this problem and will keep you updated on the status of correction. Please be patient and spare us little time. We are sorry for this inconvenience.

Hello. I have downloaded the new Aspose PDF Java 9.0 and attempted to convert the the attached PDF to Word format.

The
resulting document is clearly ‘wrong’ but I’m not sure if the
underlying problem has been logged yet in the Aspose system. I’m not
even sure how to describe it, other than to say the text is “all messed
up”. Would you kindly review the before and after documents and let me
know if a new issue should be logged, or if it’s already addressed by an
outstanding tracking number?

Thank you.

Hi Chris,


Thanks for your inquiry. We have tested the PDF to DOC conversion scenario with latest version of Aspose.Pdf for .Java and noticed some issues and logged these in our issue tracking system as following. We will notify you as soon as these are resolved.

PDFNEWJAVA-34166: Split words
PDFNEWJAVA-34167: Bullet list rendering issue.
PDFNEWJAVA-34168: URL rendering issue.
PDFNEWJAVA-34169: Table data rendering issue

We are sorry for the inconvenience caused.

Best Regards,

The issues you have found earlier (filed as PDFNEWJAVA-33929) have been fixed in Aspose.Pdf for Java 9.3.0.


This message was posted using Notification2Forum from Downloads module by Aspose Notifier.

The issues you have found earlier (filed as PDFNEWJAVA-33931) have been fixed in this update.


This message was posted using Notification2Forum from Downloads module by Aspose Notifier.

The issues you have found earlier (filed as PDFNEWJAVA-34166;PDFNEWJAVA-33933) have been fixed in Aspose.Pdf for Java 9.5.0.


This message was posted using Notification2Forum from Downloads module by Aspose Notifier.

The issues you have found earlier (filed as PDFNEWJAVA-34168) have been fixed in Aspose.Pdf for Java 9.5.2.


This message was posted using Notification2Forum from Downloads module by Aspose Notifier.

The issues you have found earlier (filed as PDFNEWJAVA-33934;PDFNEWJAVA-33932;PDFNEWJAVA-34167) have been fixed in Aspose.Pdf for Java 9.7.1.


This message was posted using Notification2Forum from Downloads module by Aspose Notifier.

The issues you have found earlier (filed as PDFNEWJAVA-34101) have been fixed in Aspose.Pdf for Java 10.4.0.


This message was posted using Notification2Forum from Downloads module by Aspose Notifier.

The issues you have found earlier (filed as ) have been fixed in this update. This message was posted using BugNotificationTool from Downloads module by MuzammilKhan