Hello.
I have a .rtf document and I have to save it to .docx. After that, I use Apache Tika to extract XHTML from the .docx. I am not using Aspose in this last step because I was not able to extract a clean HTML (without formatting).
The code is as follows:
Document doc = new Document("D:\\Teste\\10932983.rtf");
doc.removeMacros();
OoxmlSaveOptions options = new OoxmlSaveOptions(SaveFormat.DOCX);
options.setDmlEffectsRenderingMode(0);
options.setDmlRenderingMode(1);
options.setCompliance(OoxmlCompliance.ISO_29500_2008_TRANSITIONAL);
doc.getCompatibilityOptions().setDisableOpenTypeFontFormattingFeatures(true);
doc.getCompatibilityOptions().optimizeFor(MsWordVersion.WORD_2013);
doc.save("D:\\Teste\\serah3.docx", options);
The problem is that the generated .docx from Aspose does not follow the definition of DrawingML that pictures come in the pic:pic element (<pic:pic xmlns:pic="http://schemas.openxmlformats.org/drawingml/2006/picture">
). So Tika does not recognize the pictures when reading the docx generated by Aspose.
Aspose generates the picture in the v:shape element. I guess this v:shape element is microsoft proprietary?
When I save the document with Microsoft Word, it works fine.
I’ve tried different configurations of setDmlEffectsRenderingMode and setDmlRenderingMode without success.
I need help. Thanks in advance.