When i convert word to html, the convert`s result lost the html i insert into word

baifang · August 11, 2022, 7:15am

i use html to repalce the text i find in the specific location, and then i convert word to html, i hope the html which i insert into the word can be saved, but the convert`s result seems lost my html.i do not know whether the way i cope with this issue is right.
My purpose is use a new text to replace the specific text in the Word, and mark the new text, tag a uniqe id, then when i convert to html i can use the id find the new text in the html.
Looking forword to your reply. Thank you very much!

     @Test
    public void shubao_test1() throws Exception {
        String resultPath = "D:\\opt\\compare\\测试文档.docx";
        String savePath = "D:\\opt\\test\\测试文档.html";
        Document document = new Document(resultPath);
        AsposeWordUtils.replaceWithHtml(document, "十一五", "shiyiwu", 123456L, 0);
        HtmlFixedSaveOptions htmlFixedSaveOptions = new HtmlFixedSaveOptions();
        htmlFixedSaveOptions.setExportEmbeddedCss(true);
        htmlFixedSaveOptions.setExportEmbeddedFonts(true);
        htmlFixedSaveOptions.setExportEmbeddedImages(true);
        htmlFixedSaveOptions.setShowPageBorder(true);
        document.save(savePath, htmlFixedSaveOptions);
    }

    public static void replaceWithHtml(Document document, String oldTxt, String newTxt, Long markId, int count) throws Exception {
        //Use regular expression to match string in Word document
        // Specify options for find and replace operations
        FindReplaceOptions options = new FindReplaceOptions();
        options.setDirection(FindReplaceDirection.FORWARD);
        options.setReplacingCallback(new ReplaceWithHtmlEvaluator(options, count));
        document.getRange().replace(oldTxt, "<mark id=\"" + markId + "\">" + newTxt + "</mark>", options);
    }

    public class ReplaceWithHtmlEvaluator implements IReplacingCallback {
    private FindReplaceOptions findReplaceOptions;
    private int replaceCount = 0;
    private int target = 0;

    public ReplaceWithHtmlEvaluator(FindReplaceOptions findReplaceOptions, int target) {
        this.findReplaceOptions = findReplaceOptions;
        this.target = target;
    }


    @Override
    public int replacing(ReplacingArgs replacingArgs) throws Exception {
        if (replaceCount < target) {
            replaceCount++;
            return ReplaceAction.SKIP;
        }
        if (replaceCount > target) {
            return ReplaceAction.STOP;
        }
        Node currentNode = replacingArgs.getMatchNode();
        if (replacingArgs.getMatchOffset() > 0) {
            //从匹配到的元素那里开始截取
            currentNode = SplitRun((Run) currentNode, replacingArgs.getMatchOffset());
        }
        ArrayList runs = new ArrayList();
        int remainingLength = replacingArgs.getMatch().group().length();
        if ((currentNode != null) && (remainingLength > 0)) {
            SplitRun((Run) currentNode, remainingLength);
            runs.add(currentNode);
        }
        DocumentBuilder builder = new DocumentBuilder((Document) replacingArgs.getMatchNode().getDocument());
        builder.moveTo((Run) runs.get(0));
        builder.insertHtml(replacingArgs.getReplacement());
        // Now remove the matched node
        ((Run) runs.get(0)).remove();
        replaceCount++;
        // Signal to the replace engine to do nothing because we have already done all what we wanted.
        return ReplaceAction.SKIP;
    }

    private static Run SplitRun(Run run, int position) {
        if (position + 1 > run.getText().length()) {
            return run;
        }
        Run afterRun = (Run) run.deepClone(true);
        afterRun.setText(run.getText().substring(position));
        run.setText(run.getText().substring(0, position));
        run.getParentNode().insertAfter(afterRun, run);
        return afterRun;
    }
}

测试文档.docx (11.2 KB)

alexey.noskov · August 11, 2022, 7:28am

@baifang When you insert HTML into the document using DocumentBuilder.insertHtml method, Aspose.Words interprets the inserted html snippet and it is converted to the appropriate nodes in Aspose.Words Document Object Model. There is no way to preserve the original HTML tags after inserting HTML into the document and further conversion the document to HTML.

baifang · August 11, 2022, 7:55am

ok,thank you very much. so can i ask you give me some suggestions? About how to tag those new text in the Html. Or should i directly use Aspose.html, use it to replace text? but it seems not has the method about find-and-repalce

alexey.noskov · August 11, 2022, 8:33am

@baifang You can consider marking the inserted HTML content with a bookmark. For example see the following code:

Document doc = new Document();
DocumentBuilder builder = new DocumentBuilder(doc);
builder.write("This is content before inserted HTML snippet. ");
// Start bookmark
builder.startBookmark("mybk");
builder.insertHtml("<b>bold text from HTML</b>");
// end bookmark.
builder.endBookmark("mybk");
builder.write(" This is content after inserted HTML snippet.");

HtmlSaveOptions options = new HtmlSaveOptions();
options.setPrettyFormat(true);
doc.save("C:\\Temp\\out.html", options);

As a result the inserted HTML content will be marked with a bookmark in the output HTML:

<a name="mybk"><span style="font-weight:bold">bold text from HTML</span></a>

baifang · August 12, 2022, 2:50am

Thanks! It`s really a good idea to use bookmark to solve this. I have done a try, but i do not know why the text i inserted not surrounded by the bookmark tag?The demo code you give me is normal.

@Override
public int replacing(ReplacingArgs replacingArgs) throws Exception {
    if (replaceCount < target) {
        replaceCount++;
        return ReplaceAction.SKIP;
    }
    if (replaceCount > target) {
        return ReplaceAction.STOP;
    }
    Node currentNode = replacingArgs.getMatchNode();
    if (replacingArgs.getMatchOffset() > 0) {
        //从匹配到的元素那里开始截取
        currentNode = SplitRun((Run) currentNode, replacingArgs.getMatchOffset());
    }
    ArrayList runs = new ArrayList();
    int remainingLength = replacingArgs.getMatch().group().length();
    if ((currentNode != null) && (remainingLength > 0)) {
        SplitRun((Run) currentNode, remainingLength);
        runs.add(currentNode);
    }
    DocumentBuilder builder = new DocumentBuilder((Document) replacingArgs.getMatchNode().getDocument());
    builder.moveTo((Run) runs.get(0));
    builder.startBookmark("1112222");
    builder.insertHtml("<mark>"+replacingArgs.getReplacement()+"</mark>");
    builder.endBookmark("1112222");
    // Now remove the matched node
    ((Run) runs.get(0)).remove();
    replaceCount++;
    // Signal to the replace engine to do nothing because we have already done all what we wanted.
    return ReplaceAction.SKIP;
}

image.png (202.1 KB)
image.png (92.1 KB)

alexey.noskov · August 12, 2022, 5:05am

@baifang Could you please also provide HTML snippet, which is used as a replacement? I will check on my side and provide you more information.

baifang · August 12, 2022, 6:10am

The html snippet I used as a replacement: "<mark>shierwu</mark>".

Then I try to use string but not html, it`s also cannot make the bookmark tag sorround the string I inserted. Details as Follows,

DocumentBuilder builder = new DocumentBuilder((Document) replacingArgs.getMatchNode().getDocument());
((Run)runs.get(0)).getFont().setHighlightColor(Color.YELLOW);
builder.moveTo((Run) runs.get(0));
builder.startBookmark("1112222");
builder.write("shierwu");
builder.endBookmark("1112222");
((Run) runs.get(0)).remove();

alexey.noskov · August 12, 2022, 7:37am

@baifang Thank you for additional information. Unfortunately, I cannot reproduce the problem on my side. Here is code I used for testing:

Document doc = new Document("C:\\Temp\\in.docx");
FindReplaceOptions options = new FindReplaceOptions(FindReplaceDirection.BACKWARD);
options.setReplacingCallback(new ReplaceWithHtmlCallback());
doc.getRange().replace("<%placeholder%>", "<mark>shierwu</mark>", options);
HtmlSaveOptions htmlSaveOptions = new HtmlSaveOptions();
htmlSaveOptions.setPrettyFormat(true);
doc.save("C:\\Temp\\out.html", htmlSaveOptions);

import com.aspose.words.*;
import java.util.ArrayList;

public class ReplaceWithHtmlCallback implements IReplacingCallback {

    /**
     * This method is called by the Aspose.Words find and replace engine for each match.
     */
    @Override
    public int replacing(ReplacingArgs e) throws Exception {

        Document doc = (Document)e.getMatchNode().getDocument();

        // This is a Run node that contains either the beginning or the complete match.
        Node currentNode = e.getMatchNode();

        // The first (and may be the only) run can contain text before the match,
        // in this case it is necessary to split the run.
        if (e.getMatchOffset() > 0)
            currentNode = splitRun((Run)currentNode, e.getMatchOffset());

        // This array is used to store all nodes of the match for further deleting.
        ArrayList<Run> runs = new ArrayList<Run>();

        // Find all runs that contain parts of the match string.
        int remainingLength = e.getMatch().group().length();
        while (
                remainingLength > 0 &&
                        currentNode != null &&
                        currentNode.getText().length() <= remainingLength)
        {
            runs.add((Run)currentNode);
            remainingLength -= currentNode.getText().length();

            // Select the next Run node.
            // Have to loop because there could be other nodes such as BookmarkStart etc.
            do
            {
                currentNode = currentNode.getNextSibling();
            } while (currentNode != null && currentNode.getNodeType() != NodeType.RUN);
        }

        // Split the last run that contains the match if there is any text left.
        if (currentNode != null && remainingLength > 0)
        {
            splitRun((Run)currentNode, remainingLength);
            runs.add((Run)currentNode);
        }

        // Create DocumentBuilder to insert HTML.
        DocumentBuilder builder = new DocumentBuilder(doc);
        // Move builder to the first run.
        builder.moveTo(runs.get(0));
        // Insert HTML.
        String bkName = "bookmark_"+mBookmarkIndex;
        mBookmarkIndex++;
        builder.startBookmark(bkName);
        builder.insertHtml(e.getReplacement());
        builder.endBookmark(bkName);

        // Delete matched runs
        for (Run run : runs)
            run.remove();

        // Signal to the replace engine to do nothing because we have already done all what we wanted.
        return ReplaceAction.SKIP;
    }

    private static Run splitRun(Run run, int position)
    {
        Run afterRun = (Run)run.deepClone(true);
        run.getParentNode().insertAfter(afterRun, run);
        afterRun.setText(run.getText().substring(position));
        run.setText(run.getText().substring(0, position));
        return afterRun;
    }

    private int mBookmarkIndex = 0;
}

And here is output HTML produced on my side:

<span style="line-height:108%; font-size:11pt">This is a document </span><a name="bookmark_1"><span style="font-family:'Times New Roman'; background-color:#ffff00">shierwu</span></a><span style="line-height:108%; font-size:11pt"> and more </span><a name="bookmark_0"><span style="font-family:'Times New Roman'; background-color:#ffff00">shierwu</span></a>

As you can see replacement is surrounded with a bookmark.

baifang · August 12, 2022, 8:50am

Thank you for you reply! I think I have finded the reason! I compared you code with mine, in your code, you use HtmlSaveOptions when you save the document into HTML, however, I use HtmlFixedSaveOptions. Although I still connot understand the essential distiction between them, it’s also a pleasure to reproduce the problem.Thank you again for your help!

alexey.noskov · August 12, 2022, 8:57am

@baifang When you save document in HtmlFixed format, the document is rendered the same way as it is rendered to PDF or any other fixed page format. This format is perfect for viewving the document, since the document looks exactly the same as it look in MS Word. On other hand Html is flow format it is good for editing.
Unfortunately, the bookmark trick will not work for HtmlFixed format, since in document layout model there is no bookmarked range but bookmark point. As an option you ca insert two bookmarks, like start bookmark and end bookmark and then search for content between two bookmarks.

baifang · August 12, 2022, 9:02am

Ok, I get it! Thank you for your answer!