Problem with DrawingML (conversion from RTF to DOCX)

Hello.

I have a .rtf document and I have to save it to .docx. After that, I use Apache Tika to extract XHTML from the .docx. I am not using Aspose in this last step because I was not able to extract a clean HTML (without formatting).

The code is as follows:

Document doc = new Document("D:\\Teste\\10932983.rtf");
doc.removeMacros();
	
OoxmlSaveOptions options = new OoxmlSaveOptions(SaveFormat.DOCX);
options.setDmlEffectsRenderingMode(0);
options.setDmlRenderingMode(1);
options.setCompliance(OoxmlCompliance.ISO_29500_2008_TRANSITIONAL);
doc.getCompatibilityOptions().setDisableOpenTypeFontFormattingFeatures(true);
doc.getCompatibilityOptions().optimizeFor(MsWordVersion.WORD_2013);
doc.save("D:\\Teste\\serah3.docx", options);

The problem is that the generated .docx from Aspose does not follow the definition of DrawingML that pictures come in the pic:pic element (<pic:pic xmlns:pic="http://schemas.openxmlformats.org/drawingml/2006/picture">). So Tika does not recognize the pictures when reading the docx generated by Aspose.

Aspose generates the picture in the v:shape element. I guess this v:shape element is microsoft proprietary?

When I save the document with Microsoft Word, it works fine.

I’ve tried different configurations of setDmlEffectsRenderingMode and setDmlRenderingMode without success.

I need help. Thanks in advance.

Hi Alessandra,


Thanks for your inquiry. To ensure a timely and accurate response, please attach the following resources here for testing:

  • Your input RTF document
  • Aspose.Words generated output DOCX document showing the undesired behavior
  • MS Word generated DOCX document showing the correct behavior
  • XHTML files extracted from MS Word and Aspose.Words generated DOCX files.

As soon as you get these pieces of information ready, we’ll start investigation into your issue and provide you more information. Thanks for your cooperation.

PS: To attach these resources, please zip them and Click ‘Reply’ button that will bring you to the ‘reply page’ and there at the bottom you can include any attachments with that post by clicking the ‘Add/Update’ button.

Best regards,

Thank you for the reply.

The files are attached.

Check line 408 of xhtml_generated_from_word_doc.html. It shows:
Localização do Pará no Brasil

This image is identified by Tika from the following element of docx:
<pic:pic xmlns:pic="http://schemas.openxmlformats.org/drawingml/2006/picture">
(Tika searches the namespace http://schemas.openxmlformats.org/drawingml/2006/picture and the pic:pic element)

In the xhtml generated from aspose’s docx, there’s no img element.

Thank you for your attention.

Hi Alessandra,


Thanks for your inquiry. We tested the scenario and have managed to reproduce the same problem on our end. For the sake of correction, we have logged this problem in our issue tracking system as WORDSNET-13719. Our product team will further look into the details of this problem and we we will keep you updated on the status of correction. We apologize for your inconvenience.

Best regards,

How is the status of this issue? Was it corrected?


Thanks in advance.

Hi Alessandra,


Thanks for your inquiry. Unfortunately, this issue is not resolved yet. This issue is currently pending for analysis and is in the queue. We will keep you informed and let you know once this issue is resolved. Sorry for inconvenience.

Best regards,

Hi Alessandra,


Thanks for your inquiry. After an initial test with Aspose.Words for Java 17.1.0, we were unable to reproduce this issue on our end (see attached DOCX output).

We would suggest you please upgrade to the latest version of Aspose.Words. You can download it from the following link:

Hope, this helps.

Best regards,

The issues you have found earlier (filed as WORDSNET-13719) have been fixed in this Aspose.Words for .NET 17.2.0 update and this Aspose.Words for Java 17.2.0 update.


This message was posted using Notification2Forum from Downloads module by aspose.notifier.