HTML image convert and removing TOC

branislav.cavlin · January 31, 2013, 9:36am

Hi,

I am evaluating your total product for Java and I have found two issues that you might help me with:

When inserting HTML to Word, it seems that image from http source cannot be imported. I had to extract the image, convert bytes to Base64 and then replace image src to make it work. Can you convert images from HTML? You can use http://ckeditor.com/demo HTML text to test it. (I was using builder.insertHtml(…).
When merging two documents, I had an issue trying to remove TOC from second document. I tried two things, one with DocumentVisitor overriding Field start, separator and end and passing skip node action, but that did not work, and second, iterating through nodes and removing them, but that did not work either. Code sample is here for second action http://paste.ubuntu.com/1593347/.

Any help is appreciated.

Thanks.

awais.hafeez · February 1, 2013, 11:01am

Hi Branislav,

Thanks for your inquiry.

Please save the Html string you’re getting this problem with in a text file and attach the file here for testing.
Could you please also attach your input Word documents (.doc files) here for testing? I will investigate the issues on my side and provide you more information.

Best regards,

branislav.cavlin · February 1, 2013, 1:47pm

Hi,

Files are attached.

Thanks

awais.hafeez · February 4, 2013, 10:18am

Hi Branislav,

Thanks for your inquiry.

*Branislav:

When inserting HTML to Word, it seems that image from http source cannot be imported. I had to extract the image, convert bytes to Base64 and then replace image src to make it work. Can you convert images from HTML? You can use http://ckeditor.com/demo HTML text to test it.*

Please note that Aspose.Words automatically downloads the image before inserting into the Word document if you specify a remote URL in src attribute of in Html. Therefore, the output Word document contains the image embedded in it and you don’t need to extract the image, convert bytes to Base64 and then replace image src to make it work.

Secondly, while using the latest version of Aspose.Words for Java i.e. 13.1, I was unable to reproduce this issue on my side. I would suggest you please upgrade to the latest version. You can download it from the following link:
https://releases.aspose.com/words/java

I have also attached the Word document that is generated on my side here for your reference.

Moreover, I am working over the second part of your request and will get back to you soon.

Best regards,

awais.hafeez · February 7, 2013, 11:55pm

Hi Branislav,

Thanks for your patience.

Please note that TOC in Word document is actually represented by a field. Every field in the Word document starts from a FieldStart and ends with FieldEnd nodes. Therefore, to completely remove TOC from the document you should remove all content between these nodes. I would suggest you the following java code to remove TOC from your document:

private static void removeSequence(Node start, Node end)
{
    Node curNode = start.nextPreOrder(start.getDocument());
    while (curNode != null && !curNode.equals(end))
    {
        Node nextNode = curNode.nextPreOrder(start.getDocument());
        if (curNode.isComposite())
        {
            if (!((CompositeNode)curNode).getChildNodes().contains(end) &&
                                !((CompositeNode)curNode).getChildNodes().contains(start))
            {
                nextNode = curNode.getNextSibling();
                curNode.remove();
            }
        }
        else
        {
            curNode.remove();
        }
        curNode = nextNode;
    }
}

Document doc = new Document("C:\\Temp\\merge-document.doc");
NodeCollection starts = doc.getChildNodes(NodeType.FIELD_START, true);

for (FieldStart start : (Iterable<FieldStart>)starts)
{
    if (start.getFieldType() == FieldType.FIELD_TOC)
    {
        Node curNode = start;
        while (!(curNode.getNodeType() == NodeType.FIELD_END &&
                        ((FieldEnd)curNode).getFieldType() == FieldType.FIELD_TOC))
        {
            curNode = curNode.nextPreOrder(start.getDocument());
        }
        removeSequence(start, curNode);
        start.remove();
        curNode.remove();
        break;
    }
}
doc.save("C:\\Temp\\out.doc");

I hope, this helps.

Best regards,

alexey.noskov · January 22, 2024, 8:27am

2 posts were split to a new topic: Skip images upon loading document