I have a .rtf document and I have to save it to .docx. After that, I use Apache Tika to extract XHTML from the .docx. I am not using Aspose in this last step because I was not able to extract a clean HTML (without formatting).
The code is as follows:
Document doc = new Document("D:\\Teste\\10932983.rtf"); doc.removeMacros(); OoxmlSaveOptions options = new OoxmlSaveOptions(SaveFormat.DOCX); options.setDmlEffectsRenderingMode(0); options.setDmlRenderingMode(1); options.setCompliance(OoxmlCompliance.ISO_29500_2008_TRANSITIONAL); doc.getCompatibilityOptions().setDisableOpenTypeFontFormattingFeatures(true); doc.getCompatibilityOptions().optimizeFor(MsWordVersion.WORD_2013); doc.save("D:\\Teste\\serah3.docx", options);
The problem is that the generated .docx from Aspose does not follow the definition of DrawingML that pictures come in the pic:pic element (
<pic:pic xmlns:pic="http://schemas.openxmlformats.org/drawingml/2006/picture">). So Tika does not recognize the pictures when reading the docx generated by Aspose.
Aspose generates the picture in the v:shape element. I guess this v:shape element is microsoft proprietary?
When I save the document with Microsoft Word, it works fine.
I’ve tried different configurations of setDmlEffectsRenderingMode and setDmlRenderingMode without success.
I need help. Thanks in advance.