Toc's end is not \u0015

I must use poi.
I want to convert aspose’s word to html by poi.However in poi,the toc of aspose’s word is end of “\u0015\u0015”.It make html only have one .
In Microsoft word,the end is two run:"\u0015""\u0015".how can I do?
aspose:
8.18一 一一一一一一一一 一一一:一一一一一一一一一一(苹果) 1
aspose use poi generate html:
<p class="p10"> <span> HYPERLINK \l "_Toc256000002" </span><span class="s4">8.18一 一一一一一一一一 一一一:一一一一一一一一一一(苹果)</span><a href="#_Toc256000002"><span> 1</span></a> </p>
end:
image.png (12.2 KB)

word:
poi generate html:
<p class="p6"> <a href="#_Toc65677669"><span class="s4">8.18</span><span class="s4">一 一一一一一一一一 一一一:一一一一一一一一一一(苹果)</span><span> </span></a><a href="#_Toc65677669"><span>1</span></a> </p>
the end is:
image.png (5.6 KB)
image.png (7.0 KB)
I need two end,but aspose only one.

@xl1,

Please ZIP and attach the following resources here for testing:

  • Your simplified Word document
  • Aspose.Words for Java 21.2 generated output HTML file showing the undesired behavior
  • Your expected HTML file showing the desired output. You can create this document manually by using MS Word.
  • A standalone simple Java application (source code without compilation errors) that helps us to reproduce your current problem on our end and attach it here for testing. Please do not include Aspose.Words JAR files in it to reduce the file size.

As soon as you get these pieces of information ready, we will start investigation into your scenario/issue and provide you more information.

I use aspose generate word:aspose generate.zip (59.0 KB)
poigenerate/.html:the word convert to html by poi(changepoi2html)
openoffice.html:the word convert to html by openoffice 4.1.3
the word update toc by MS office:this is my expected word.
aspose generate and MS word update toc.zip (108.0 KB)
the tool of poi:it can convert aspose to html by poi and remove pagenum in TOC.

changepoi2html.zip (121.3 KB)
the openoffice 4.1.3:Apache OpenOffice - Official Download
image.png (15.9 KB)
openoffice html remove pagenum by :deleteTOCpagenumformOpenOfficeTOC.zip (1.1 KB)

the problem of poi as above.
the problem of openoffice is :How to set lang of Run?
aspose’s html have one <span></span>,and MS word have two.
I need that the word generated by aspose can work well in two application.
I want to convert aspose to html and remove pagenum.
I don’t know the difference of two word,and how can I do ?
Thank you very much!

@xl1,

We are checking this scenario and will get back to you soon.

@xl1,

You had shared a DOC file contained inside “aspose generate.zip” (see source.zip (42.6 KB)) which produced an undesired behavior when I converted it to HTML format by using the following simple Java code of POI 5.0.0:

HWPFDocument wordDocument = new HWPFDocument(
        new FileInputStream("C:\\Temp\\226429\\in.doc"));

WordToHtmlConverter wordToHtmlConverter = new WordToHtmlConverter(newDocumentBuilder().newDocument());

wordToHtmlConverter.processDocument(wordDocument);
Document htmlDocument = wordToHtmlConverter.getDocument();

OutputStream outStream =
        new FileOutputStream("C:\\Temp\\226429\\poi out.html");

DOMSource domSource = new DOMSource(htmlDocument);
StreamResult streamResult = new StreamResult(outStream);

TransformerFactory factory = TransformerFactory.newInstance();
Transformer serializer = factory.newTransformer();
serializer.setOutputProperty(OutputKeys.ENCODING, "utf-8");//cmsConfig.getEncoding()
serializer.setOutputProperty(OutputKeys.INDENT, "yes");
serializer.setOutputProperty(OutputKeys.METHOD, "html");

serializer.transform(domSource, streamResult);

outStream.close();

Problematic HTML produced by above code is as follows:

<p class="p10">
<span> HYPERLINK \l "_Toc256000002" </span><span class="s4">8.18一 一一一一一一一一 一一一:一一一一一一一一一一(苹果)</span><a href="#_Toc256000002"><span>	1</span></a>
</p>

I then resaved this DOC after updating fields:

Document doc = new Document("C:\\temp\\226429\\in.doc");
doc.updateFields();
doc.save("C:\\temp\\226429\\awjava-21.3 UpdateFields.doc");

And then ran above POI code to produce the following correct HTML:

<p class="p10">
<span> HYPERLINK \l "_Toc256000002" </span><span class="s4">8.18一 一一一一一一一一 一一一:一一一一一一一一一一(苹果)</span><span>	</span><a href="#_Toc256000002"><span>1</span></a>
</p>

So, please try to update toc fields by using the latest 21.3 version of Aspose.Words for Java and then pass it to POI module to convert it to HTML.