Html文件转word文件，部分样式不同

ZhonghaoSun · May 21, 2024, 10:56am

版本：23.8
编程语言：java

样式问题：
1.部分标题颜色变成蓝色
2.标题多出【一】
3.部分标题出现黑色矩形框

问题截图：
eteams_2024-05-21_18-53-12.jpg (75.0 KB)

转换后的wrod文件：
文件转换样式测试-convert.zip (12.0 KB)

转换代码：

public void htmlToWord(String in, String out) throws Exception {
			Document html = new Document(in);
			html.save(out, SaveFormat.DOCX);
		}

是否可以通过修改转换配置来解决呢

vyacheslav.deryushev · May 21, 2024, 3:53pm

@ZhonghaoSun 您提供的文件与 html 文件中设置的参数一致。对于列表，您有:

-aw-list-number-styles:'chineseCountingThousand decimal'

这导致【一】

<span style="font-family:'Times New Roman'; font-size:0pt; font-weight:normal; background-color:#000000">1.1.1.3</span>

这是关于 - 部分标题出现黑色矩形框

不存在任何 Aspose.Words 问题。所有工作都符合预期。您需要在 html 文档中进行更改或使用以下代码：

Document doc = new Document("input.html");

for (Paragraph para : (Iterable<Paragraph>) doc.getChildNodes(NodeType.PARAGRAPH, true)) {
    if (para.getListFormat().isListItem()) {
        ListLevel listLevel = para.getListFormat().getListLevel();
        listLevel.getFont().setColor(Color.BLACK);
        if (para.getParagraphFormat().getStyleIdentifier() == StyleIdentifier.HEADING_2) {
            listLevel.setNumberStyle(NumberStyle.ARABIC);
            listLevel.setNumberFormat("\u0001.\u0001、");
        }
        if (para.getParagraphFormat().getStyleIdentifier() == StyleIdentifier.HEADING_3) {
            listLevel.setNumberStyle(NumberStyle.ARABIC);
            listLevel.setNumberFormat("\u0001.\u0001.\u0002、");
        }
        if (para.getParagraphFormat().getStyleIdentifier() == StyleIdentifier.HEADING_4) {
            listLevel.setNumberStyle(NumberStyle.ARABIC);
            listLevel.setNumberFormat("\u0001.\u0001.\u0002.\u0003、");
            listLevel.getFont().setSize(10.5);
            listLevel.getFont().setBold(true);
            listLevel.getFont().getShading().clearFormatting();
        }
    }
}
doc.updateListLabels();

doc.save("output.docx");

ZhonghaoSun · May 22, 2024, 1:39am

@vyacheslav.deryushev
我们这边排查到，这里用的html文件也是使用Aspose由word转换来的。
我整理下完整的调用步骤：

word原文件：
文件转换样式测试-word原文件.zip (30.6 KB)

1.用word文件转换为html
使用的代码：

               String wordPath = "D:\\XXXXX\\文件转换样式测试.docx";
		LoadOptions loadOptions =  new LoadOptions();
		Document doc = new Document(wordPath, loadOptions);
		HtmlSaveOptions saveOptions = null;
		saveOptions = new HtmlSaveOptions();
		saveOptions.setExportImagesAsBase64(true);
		saveOptions.setScaleImageToShapeSize(false);
		doc.save( "文件转换样式测试.html", saveOptions);

转换后的html文件：
文件转换样式测试.zip (3.1 KB)

2.html再转换为word文件
使用的代码：

Document html = new Document(in);
html.save(out, SaveFormat.DOCX);

转换后的word文件：
文件转换样式测试-convert.zip (12.0 KB)

在经过以上两步的转换后，出现了【一】、黑色矩形框、标题变成蓝色

ZhonghaoSun · May 22, 2024, 1:44am

测了您提供的这段代码，是可以解决这三个问题。
但是我们的转换流程是1.word转换html，2.html再转换word
第1步中的原word文件没有 1.1.1.1~1.1.1.3标题
image.png (69.9 KB)
麻烦再结合完整的转换流程帮忙看下呢

vyacheslav.deryushev · May 22, 2024, 7:10am

@ZhonghaoSun 目前，在您的原始 docx 文件中，1.1.1.1~1.1.1.3 列表编号的字体大小为 0，背景颜色为 0。如图所示：

我创建了一个关于保留列表编号的问题。
我们已在内部问题跟踪系统中打开以下新票证，并将根据免费支持政策中提到的条款提供修复。

Issue ID(s): WORDSNET-26996

如果您需要优先支持，以及直接联系我们的付费支持管理团队，您可以获得付费支持服务。

请注意，在 DOCX->HTML->DOCX 往返过程中，由于 HTML 和 MS Word 文档对象模型的显著差异，并不总是能提供 100% 的保真度。

要获得正确的输出结果，请使用以下代码：

Document doc = new Document("orig.docx");

HtmlSaveOptions saveOptions = new HtmlSaveOptions();
saveOptions.setExportImagesAsBase64(true);
saveOptions.setScaleImageToShapeSize(false);

doc.save("output.html", saveOptions);
doc = new Document("output.html");

for (Paragraph para : (Iterable<Paragraph>) doc.getChildNodes(NodeType.PARAGRAPH, true)) {
    if (para.getListFormat().isListItem()) {
        ListLevel listLevel = para.getListFormat().getListLevel();
        listLevel.getFont().setColor(Color.BLACK);
        if (para.getParagraphFormat().getStyleIdentifier() == StyleIdentifier.HEADING_2) {
            listLevel.setNumberStyle(NumberStyle.ARABIC);
            listLevel.setNumberFormat("\u0001.\u0001、");
        }
        if (para.getParagraphFormat().getStyleIdentifier() == StyleIdentifier.HEADING_3) {
            listLevel.setNumberStyle(NumberStyle.ARABIC);
            listLevel.setNumberFormat("\u0001.\u0001.\u0002、");
        }
        if (para.getParagraphFormat().getStyleIdentifier() == StyleIdentifier.HEADING_4) {
            listLevel.setTrailingCharacter(ListTrailingCharacter.NOTHING);
        }
    }
}
doc.updateListLabels();

doc.save("output.docx");

ZhonghaoSun · May 28, 2024, 9:33am

你好，html转word遇到一些报错，麻烦帮忙看下呢

问题1：

报错截图：
image.png (109.2 KB)

异常信息：

com.aspose.words.FileCorruptedException: The document appears to be corrupted and cannot be loaded.
	at com.aspose.words.FileFormatUtil.zzV3(Unknown Source)
	at com.aspose.words.Document.zzWsW(Unknown Source)
	at com.aspose.words.Document.zzVSm(Unknown Source)
	at com.aspose.words.Document.<init>(Unknown Source)
	at com.aspose.words.Document.<init>(Unknown Source)
	at com.aspose.words.Document.<init>(Unknown Source)
	at net.qiyuesuo.common.word.Html2WordUtils.convert2Word(Html2WordUtils.java:53)
	at net.qiyuesuo.common.word.Html2WordUtils.convert2Word(Html2WordUtils.java:41)
	at net.qiyuesuo.common.word.Html2WordUtils.main(Html2WordUtils.java:99)
Caused by: java.lang.IllegalStateException: XMLStreamException: Illegal to have multiple roots (start tag in epilog?).
 at [row,col {unknown-source}]: [1,68]
	at com.aspose.words.internal.zzYD9.zzVSm(Unknown Source)
	at com.aspose.words.internal.zzYD9.read(Unknown Source)
	at com.aspose.words.zzZoj.zzXoK(Unknown Source)
	at com.aspose.words.Document.zzWsW(Unknown Source)
	... 7 more
Caused by: com.aspose.words.internal.zzYob: Illegal to have multiple roots (start tag in epilog?).
 at [row,col {unknown-source}]: [1,68]
	at com.aspose.words.internal.zzWYp.zzZc5(Unknown Source)
	at com.aspose.words.internal.zzWYp.zzZ0Y(Unknown Source)
	at com.aspose.words.internal.zzVZG.zzZSH(Unknown Source)
	at com.aspose.words.internal.zzVZG.zzXQ4(Unknown Source)
	at com.aspose.words.internal.zzVZG.zzWS7(Unknown Source)
	at com.aspose.words.internal.zzVZG.zzZ52(Unknown Source)
	at com.aspose.words.internal.zzVZG.next(Unknown Source)
	at com.aspose.words.internal.zzYD9.read(Unknown Source)
	... 9 more

Process finished with exit code 1

html源文件：
test.zip (478 字节)

问题2：
异常信息：

Exception in thread "main" com.aspose.words.FileCorruptedException: The document appears to be corrupted and cannot be loaded.
	at com.aspose.words.FileFormatUtil.zzV3(Unknown Source)
	at com.aspose.words.Document.zzWsW(Unknown Source)
	at com.aspose.words.Document.zzVSm(Unknown Source)
	at com.aspose.words.Document.<init>(Unknown Source)
	at com.aspose.words.Document.<init>(Unknown Source)
	at com.aspose.words.Document.<init>(Unknown Source)
	at net.qiyuesuo.common.word.Html2WordUtils.convert2Word(Html2WordUtils.java:53)
	at net.qiyuesuo.common.word.Html2WordUtils.convert2Word(Html2WordUtils.java:41)
	at net.qiyuesuo.common.word.Html2WordUtils.main(Html2WordUtils.java:99)
Caused by: java.lang.IllegalStateException: XMLStreamException: Unexpected character '=' (code 61); expected a semi-colon after the reference for entity 'version'
 at [row,col {unknown-source}]: [1,178]
	at com.aspose.words.internal.zzYD9.zzVSm(Unknown Source)
	at com.aspose.words.internal.zzYD9.read(Unknown Source)
	at com.aspose.words.zzZoj.zzXoK(Unknown Source)
	at com.aspose.words.Document.zzWsW(Unknown Source)
	... 7 more
Caused by: com.aspose.words.internal.zzXPO: Unexpected character '=' (code 61); expected a semi-colon after the reference for entity 'version'
 at [row,col {unknown-source}]: [1,178]
	at com.aspose.words.internal.zzWYp.zzYk1(Unknown Source)
	at com.aspose.words.internal.zzWYp.zzW3k(Unknown Source)
	at com.aspose.words.internal.zzWYp.zzY6h(Unknown Source)
	at com.aspose.words.internal.zzVZG.zzVSm(Unknown Source)
	at com.aspose.words.internal.zzVZG.zzZdW(Unknown Source)
	at com.aspose.words.internal.zzVZG.zzK4(Unknown Source)
	at com.aspose.words.internal.zzVZG.zzZ52(Unknown Source)
	at com.aspose.words.internal.zzVZG.next(Unknown Source)
	at com.aspose.words.internal.zzYD9.read(Unknown Source)
	... 9 more

Process finished with exit code 1

html源文件：
test2.zip (437 字节)

使用的代码均为：

Document doc = new Document(in);
		try {
			NodeCollection childNodes = doc.getChildNodes(NodeType.PARAGRAPH, true);
			if (childNodes != null) {
				for (Paragraph para : (Iterable<Paragraph>) childNodes) {
					if (para.getListFormat().isListItem()) {
						ListLevel listLevel = para.getListFormat().getListLevel();
						listLevel.getFont().setColor(Color.BLACK);
						if (para.getParagraphFormat().getStyleIdentifier() == StyleIdentifier.HEADING_2) {
							listLevel.setNumberStyle(NumberStyle.ARABIC);
							listLevel.setNumberFormat("\u0001.\u0001、");
						}
						if (para.getParagraphFormat().getStyleIdentifier() == StyleIdentifier.HEADING_3) {
							listLevel.setNumberStyle(NumberStyle.ARABIC);
							listLevel.setNumberFormat("\u0001.\u0001.\u0002、");
						}
						if (para.getParagraphFormat().getStyleIdentifier() == StyleIdentifier.HEADING_4) {
							listLevel.setNumberStyle(NumberStyle.ARABIC);
							listLevel.setNumberFormat("\u0001.\u0001.\u0002.\u0003、");
							listLevel.getFont().setSize(10.5);
							listLevel.getFont().setBold(true);
							listLevel.getFont().getShading().clearFormatting();
						}
					}
				}
			}
		} catch (Exception e) {
			logger.warn("html转换word,处理样式异常", e);
		}
		doc.updateListLabels();
		doc.save(out, SaveFormat.DOCX);

vyacheslav.deryushev · May 28, 2024, 11:59am

@ZhonghaoSun 您需要将文件内容放入 <html>...content...</html> 标记中。否则，如果没有这些标记，Aspose.Words 文件格式检测器就无法检测到这是 html 文件。因此，对于 Aspose.Words 来说，如果没有这些标记，就只是一个扩展名不同的 txt 文件，它会阻止此类文件。

vyacheslav.deryushev · May 28, 2024, 12:05pm

@ZhonghaoSun 在这种情况下，如果不想在文档中加入 <html> 标记，只有 builder.insertHtml 可以提供帮助。

ZhonghaoSun · May 29, 2024, 8:18am

好的，我还想问下这段代码能否指定转换后的docx的页面大小。

	public static void convert2Word(InputStream in, OutputStream out) throws Exception {
		Document doc = new Document(in);
		try {
			NodeCollection childNodes = doc.getChildNodes(NodeType.PARAGRAPH, true);
			if (childNodes != null) {
				for (Paragraph para : (Iterable<Paragraph>) childNodes) {
					if (para.getListFormat().isListItem()) {
						ListLevel listLevel = para.getListFormat().getListLevel();
						listLevel.getFont().setColor(Color.BLACK);
						if (para.getParagraphFormat().getStyleIdentifier() == StyleIdentifier.HEADING_2) {
							listLevel.setNumberStyle(NumberStyle.ARABIC);
							listLevel.setNumberFormat("\u0001.\u0001、");
						}
						if (para.getParagraphFormat().getStyleIdentifier() == StyleIdentifier.HEADING_3) {
							listLevel.setNumberStyle(NumberStyle.ARABIC);
							listLevel.setNumberFormat("\u0001.\u0001.\u0002、");
						}
						if (para.getParagraphFormat().getStyleIdentifier() == StyleIdentifier.HEADING_4) {
							listLevel.setNumberStyle(NumberStyle.ARABIC);
							listLevel.setNumberFormat("\u0001.\u0001.\u0002.\u0003、");
							listLevel.getFont().setSize(10.5);
							listLevel.getFont().setBold(true);
							listLevel.getFont().getShading().clearFormatting();
						}
					}
				}
			}
		} catch (Exception e) {
			logger.warn("html转换word,处理样式异常", e);
		}
		doc.updateListLabels();
		doc.save(out, SaveFormat.DOCX);
	}

目前使用这段代码，从html转换为word，转换后的word文件，页面大小为 8 1/2x11
image.png (49.5 KB)

能否设置为A4大小呢

vyacheslav.deryushev · May 29, 2024, 8:22am

@ZhonghaoSun 请使用

doc.getFirstSection().getPageSetup().setPaperSize(PaperSize.A4);

如果有多个部分，则需要对所有部分都这样做。