[Java] read HTML from Doc using Aspose.Words

amey7p · May 27, 2010, 11:07am

Hello Alexey,

Refer to the attached image (1). I had inserted everything starting from 1. This is test line 1 till the image. I used the insertHTML API. However, after looking at the paragraph formatting, I see that my image falls in a new paragraph. How do I avoid this? I would like to have whatever I inserted using insertHTML to be in the same paragraph.
Most importantly, when I use your method of creating a new document for a paragraph and saving it as html (SaveFormat.HTML), and later when I look at the HTML, it looks something like the attached image (2). Can you tell me why are there some mysterious non-printable characters at some places?

TIA.
Amey.

alexey.noskov · May 27, 2010, 1:54pm

Hi

Thank you for additional information.

It is impossible, because HTML can contain other paragraphs, tables etc. Paragraph as well as Table cannot be a child of Paragraph node. Please see Aspose.Words Document Object Model for more information:
https://docs.aspose.com/words/net/aspose-words-document-object-model/
You can zip your files and attach them here. If it also is not allows to attach zip archives, you can send file to my e-mail as described here:
https://forum.aspose.com/t/aspose-words-faq/2711
Best regards.

amey7p · May 27, 2010, 11:09pm

Hi Alexey,
i have uploaded images in above post for special characters in HTML issue.

amey7p · May 28, 2010, 2:39am

Alexy can look into this issue? thanks a lot

amey7p · May 28, 2010, 5:55am

Alexey,

for paragraph issue

what i am thinking is that i will put all HTML stuff for single column in single cell of table & then i can read this single cell(which will have HTML) value from table & i will use convertHTML API for this, i tried this but i m getting error as below
java.lang.IllegalArgumentException: Cannot insert a node of this type at this location.
com.aspose.words.CompositeNode.a(CompositeNode.java:825)

here is my code:

NodeCollection w_rows = doc.getChildNodes(NodeType.ROW, true);
for (Object row : w_rows)
{
    Row w_row = (Row)row;
    if (w_row != null)
    {
        CellCollection w_cells = w_row.getCells();
        if (w_cells != null)
        {
            for (Object cell : w_cells)
            {
                Cell w_cell = (Cell)cell;
                if ("EditableFieldValueStyle".equals(w_cell.getFirstParagraph().getParagraphFormat().getStyleName()))
                {
                    Document temp = new Document();
                    temp.getFirstSection().getBody().appendChild(temp.importNode((Node)w_cell, true, ImportFormatMode.KEEP_SOURCE_FORMATTING));
                    String html = ConvertDocumentToHtml(temp);
                    System.out.println("html=" + html);
                }
            }
        }

    }
}

alexey.noskov · May 28, 2010, 2:02pm

Hi

Thanks for your request. The problem occurs because you are trying to insert Cell into the document’s Body. Only Paragraph and Table nodes can be direct children of Body. Please see DOM for more information:
https://docs.aspose.com/words/net/aspose-words-document-object-model/
In your case, you need to insert child nodes of the cell into the Body.
Best regards.

amey7p · May 30, 2010, 10:56am

Hey Alexey thanks a lot, i will try out this n will let you know…basically i want to import cell having HTML content into another document, to read doc file.You figured out anything regarding special character issue?

amey7p · May 31, 2010, 2:29am

Hey Alexey,
Please let me know about Special character issue since its very critical & urgent thanks also small doubt why URL images in doc files(InsertPicture type of field having image URL) are getting downloaded in HtmlExportImagesFolder, no need to download them, you need to download only local images either which are pasted in doc or inserted from file.

alexey.noskov · May 31, 2010, 2:41am

Hi

Thanks for your request. Could you please provide me your HTML and sample code, which will allow me to reproduce the problem with special characters on my side? I will check the issue on my side and provide you more information.
Best regards.

amey7p · May 31, 2010, 2:52am

i will mail you file & doc since i can’t upload please let me know your email id & can you comment on image behaviour? thanks

amey7p · May 31, 2010, 3:10am

Alex i figured out issue.
It was from my side in convertHTMLtoDoc API i was passing ObjectOutputStream to document.save API instead of OutputStream, so after changing this issue got solved, still it gives some 2-3 junk chars at start i will take care of them, can you comment on my image issues?
I have stored some pure HTML in database which is having image links as urls(Ex.http://www.abc.com/pqr.jpg) which are filled by User via some form(local image links like C:\myimages\abc.jpg are not allowed).now when i read this data from database & give it to word via Aspose, word properly renders this images in doc.Now when i read this doc from Aspose again using ConvertDocumentToHtml API it gives me HTML with this image link modified as HtmlExportImagesFolderAlias prefix + some randon image name generated by aspose (Ex.abc\Aspose.Words.e1ed8ef3-e25f-498f-b082-a3a308d89377.001.jpeg) now i think this should be done only for images which are pasted in word which dont have any web reference, this image replacement will cause me usability issue & server side disk space issue since i have to store image at server

alexey.noskov · May 31, 2010, 3:19am

Hi

It is perfect that you resolved the problem. 2-3 junk chars at start is BOM: http://en.wikipedia.org/wiki/Byte-order_mark
I think, in your case, you can just remove all characters before first “<” character.
Regarding images, when you insert HTML (with images) into Word document, images are inserted as embedded images. So no information about original image urls are stored in the document.
Best regards.

amey7p · May 31, 2010, 3:27am

Hi Alexey,
thanks a lot, i think you are clear about my requirement so any other workaround possible to solve this problem since i dont want image link to get changed & no server side disk space issue :(…thanks a lot…please provide me any other workaround(if possible) using Aspose

amey7p · May 31, 2010, 6:12am

Hi Alexey,
any way from Aspose side to crack this usability & server side disk space issue?

alexey.noskov · June 1, 2010, 2:20am

Hi

Thanks for your request. Unfortunately, I cannot suggest you any simple way to achieve this. As a possible solution, you can try replacing IMG tags in your HTML with placeholders (simple text). Then after inserting HTML, you can find these placeholders in your document and replace them with INCLUDEPICTURE fields. In this case, when you will convert document with INCLUDEPICTURE fields, link to image from these fields will be untouched.
You can use ReplaceEvaluator to replace text with fields:
https://reference.aspose.com/words/net/aspose.words/range/replace/
You can use DocumentBuilder.InsertField method to insert fields into the document:
https://reference.aspose.com/words/net/aspose.words/documentbuilder/insertfield/
Hope this helps.
Best regards.

amey7p · June 1, 2010, 8:27am

Hi Alexey,

Doc file contains modified data so that’s input for me i don’t have any HTML file as input also this Doc will already have images which are linked locally. so how can i proceed with this one? also can you give me some example to understand this better? thanks a lot

alexey.noskov · June 1, 2010, 11:06am

Hi

Thanks for your request. Here is simple code example, which show the technique, I suggested:

// Open your input document and create DocumentBuilder, which will help you to insert HTML.
Document doc = new Document("C:\\Temp\\in.doc");
DocumentBuilder builder = new DocumentBuilder(doc);
// This is HTML string, which you need to insert into the document.
// In my case, it is hardcoded, in your case you will get it from DB.
String html = " This is sample HTML with images  this is some text after image ";
// Here you can move DocumentBuilder cursor whereever you need.
// ......................................................
// Using regular expression replace IMG tag in your HTML with placeholder.
Pattern regex = Pattern.compile("");
Matcher matcher = regex.matcher(html);
while (matcher.find())
{
    // Now, we can repace our IMG tag with text placeholder.
    // For exampel we can use something like this:
    // {start}{end}
    html = html.replace(matcher.group(0), "{start}" + matcher.group(1) + "{end}");
}
// After replacing IMG tags with placeholders, insert HTML into the document.
builder.insertHtml(html);
// Now, we need to replace our placeholders with INCLUDEPICTURE field.
// We will use ReplaceEvaluator to achieve this.
// But firs we need to create a regular expression, which will allow us to find placeholders in the document.
Pattern placeholderSearcher = Pattern.compile("\\{start\\}([^\"'']+)\\{end\\}");
// Find and replace placeholders with INCLUDEPICTURE field.
doc.getRange().replace(placeholderSearcher, new ReplaceEvaluatorIncludePicture(), false);
// Save as DOC. You should note INCLUDEPICTURE fields are not updated automatically.
// You need to update fields in the document manually (ctrl+A and press F9).
doc.save("C:\\Temp\\out.doc");

========================================================================

private static class ReplaceEvaluatorIncludePicture implements ReplaceEvaluator
{
    /// 
    /// This method is called by the Aspose.Words find and replace engine for each match.
    /// 
    public int replace(Object sender, ReplaceEvaluatorArgs e) throws Exception
    {
        // This is a Run node that contains either the beginning or the complete match.
        Node currentNode = e.getMatchNode();
        // The first (and may be the only) run can contain text before the match,
        // in this case it is necessary to split the run.
        if (e.getMatchOffset() > 0)
            currentNode = SplitRun((Run)currentNode, e.getMatchOffset());
        // This array is used to store all nodes of the match for further deleting.
        ArrayList runs = new ArrayList();
        // Find all runs that contain parts of the match string.
        int remainingLength = e.getMatch().end() - e.getMatch().start();
        while (
                (remainingLength > 0) &&
                        (currentNode != null) &&
                        (currentNode.getText().length() <= remainingLength))
        {
            runs.add((Run)currentNode);
            remainingLength = remainingLength - currentNode.getText().length();
            // Select the next Run node.
            // Have to loop because there could be other nodes such as BookmarkStart etc.
            do
            {
                currentNode = currentNode.getNextSibling();
            }
            while ((currentNode != null) && (currentNode.getNodeType() != NodeType.RUN));
        }
        // Split the last run that contains the match if there is any text left.
        if ((currentNode != null) && (remainingLength > 0))
        {
            SplitRun((Run)currentNode, remainingLength);
            runs.add((Run)currentNode);
        }
        // Create DocumentBuilder, it will alow us to insert field.
        DocumentBuilder builder = new DocumentBuilder((Document)e.getMatchNode().getDocument());
        // Move builder to the matched node.
        builder.moveTo(runs.get(0));
        // Insert INCLUDEPICTURE field.
        builder.insertField("INCLUDEPICTURE \""+e.getMatch().group(1)+"\" \\d", "");
        // Now remove all runs in the sequence.
        for (Run run : runs)
            run.remove();
        // Signal to the replace engine to do nothing because we have already done all what we wanted.
        return ReplaceAction.SKIP;
    }
    /// 
    /// Splits text of the specified run into two runs.
    /// Inserts the new run just after the specified run.
    /// 
    private Run SplitRun(Run run, int position) throws Exception
    {
        Run afterRun = (Run)run.deepClone(true);
        afterRun.setText(run.getText().substring(position));
        run.setText(run.getText().substring(0, position));
        run.getParentNode().insertAfter(afterRun, run);
        return afterRun;
    }
}

You are free to change the code.
Best regards.

amey7p · June 1, 2010, 11:17am

Hi Alexey,

thanks for your help but Doc file should be stand alone i/p for any conversion since data present in database can’t be always consistent with present in Doc file which will create issues.I dont want to add any additional HTML from DB or any HTML file , everything will be from Doc file.I hope you got my point.

alexey.noskov · June 1, 2010, 12:20pm

Hi

Thank you for additional information. Unfortunately, I cannot suggest you any other workaround of this issue.
Best regards.

amey7p · June 2, 2010, 10:40am

Hi Alexey thanks a lot for your help, just small doubt my requirement will get solved if i use DocumentVisitor model?