[Java] read HTML from Doc using Aspose.Words

amey7p · May 20, 2010, 8:23am

Hi All,
I am new to Aspose.I went through API documentation.Good documentation :).My requirement is that i want to write some values in word doc(values can have HTML), so for writing HTML i used DocumentBuilder.insertHtml API as well as i want to read some word document which can have HTML in there nodes.I used style mode to extract node values using styles here is my code.

public static ArrayList paragraphsByStyleName(Document doc, String styleName) throws Exception
{
    ArrayList paragraphsWithStyle = new ArrayList();
    NodeCollection paragraphs = doc.getChildNodes(NodeType.PARAGRAPH, true);
    for (Object node : paragraphs)
    {
        Paragraph paragraph = (Paragraph)node;
        if ("Style1".equals(paragraph.getParagraphFormat().getStyle().getName()) ||
                "Style2".equals(paragraph.getParagraphFormat().getStyle().getName()))
            paragraphsWithStyle.add(paragraph);
    }
    return paragraphsWithStyle;
}

for (Object node : paragraphs)
{
    Paragraph paragraph = (Paragraph)node;
    System.out.println(paragraph.toTxt());
}

When i run this node’s text value dont give image which i have added :(…neither i am unable to find any API which can return my HTML
Can somebody help me?

Thanks & Regards,
- Amey

alexey.noskov · May 20, 2010, 11:29am

Hi

Thanks for your inquiry. I think, you can try using the similar approach as suggested here to achieve what you need:
https://forum.aspose.com/t/extract-text-with-formatting/73561
Hope this helps.
Best regards.

amey7p · May 21, 2010, 12:57am

Hi Alexey for thanks your support but i think it will give me whole doc file in HTML no individual node value in HTML right?

alexey.noskov · May 21, 2010, 2:57am

Hi

Thanks for your inquiry. As I mentioned in the above thread, there is no direct way to get HTML from a particular node of the document. So to achieve this, you should copy the node into an empty document and then convert this document to HTML. So as a result you will get HTML of document, which contains only node, you are interested in.
Best regards.

amey7p · May 21, 2010, 3:15am

Thanks again so as per my code above…

Paragraph paragraph = (Paragraph)node;
System.out.println(paragraph.toTxt());

This has to be replaced such that i will have to getchildnodes of each Paragraph & then append in new document right? so for each node i will have to do this right?

adam.skelton · May 21, 2010, 4:30am

Hi Amey,
That is partly correct. You don’t need to get ChildNodes for each paragraph, you can just run ImportNode with the second parameter as true so that it automatically imports all of paragraph’s children as well. You can use the code in the link that Alexey posted above like this:
You should first create a tempory document to import the nodes.

Document tempDoc = new Document();

Then take each node (and its children) while keeping its formatting by using this line of code:

tempDoc.FirstSection.Body.AppendChild(tempDoc.ImportNode(node, true, ImportFormatMode.KeepSourceFormatting));

This will import the node into the temporary document that you create. The true parameter means it copies the node’s children as well, ImportFormatMode.KeepSourceFormatting means that the formatting of the node is kept from the original document.
You can then send your tempDoc to Alexey’s ConvertDocumentToHtml method which will return the HTML of that node and its children.
Please ask if you need any further help.
Thanks,

amey7p · May 26, 2010, 1:34am

for line

tempDoc.FirstSection.Body.AppendChild(tempDoc.ImportNode(node, true,ImportFormatMode.KeepSourceFormatting));

how node variable should be initialized?

adam.skelton · May 26, 2010, 2:15am

Hi Amey,
The node is what you need to import into a new temp doc so you can save the output as HTML. Remember that paragraph is a node, so using the code included in your first post it would look like this:

NodeCollection paragraphs = doc.getChildNodes(NodeType.PARAGRAPH, true);
for (Object node : paragraphs)
{
    tempDoc.FirstSection.Body.AppendChild(tempDoc.ImportNode(node, true, ImportFormatMode.KeepSourceFormatting)
}

Thanks,

amey7p · May 26, 2010, 4:39am

Thnx a lot…
I did it & it give me o/p of all nodes in HTML with some HTML head body tags.
But i have slight diff req…i want to extract individual node HTML not full document HTML & that individual node HTML should not contain any additional HTML tags
I slightly changed my code…
For each paragraph i am calling following code:

Document temp = new Document();
temp.getFirstSection().getBody().appendChild(temp.importNode((Node)node, true, 
ImportFormatMode.KEEP_SOURCE_FORMATTING));
String html = ConvertDocumentToHtml(temp);
System.out.println("html="+html);

but this is also not giving me exact thing. :(…
Also for images it copies them into setHtmlExportImagesFolder folder & modified this image path as well.Some other workaround to copy images?
My requirement is that i have some cells in doc they have some style & values against them they are having diff styles.
Now in values against cell somebody can add HTML as well, text+image.
My requirement is that i want to extract this cell names & corresponding values against them(using paragraph styles) & save it some Map, so i can iterate through this map & save this values in Database & while rendering on page this HTML will be displayed :).
Attaching sample file as well which i want to read on styles.

alexey.noskov · May 26, 2010, 9:28am

Hi

Thank you for additional information. Unfortunately, there is no other way to get HTML of a particular node.
Also, as I can see your document contains FormFields. I suppose, you just need to get values of these FormFields. If so, please see the following link to learn how to work with FormFields:
https://docs.aspose.com/words/net/working-with-form-fields/
Best regards.

amey7p · May 26, 2010, 9:43am

Thanks but i am interested in extracting
Sdsdsdsd
Sdssssss & image & rest info
from Description(along with proper fonts & text size +image) &

sdddd2323 jhjhjhjhhh
sd23235444444444
sdsd
from Basic Course of Events(along with proper fonts & text size +image)
Both present on 2nd page

amey7p · May 26, 2010, 11:02pm

hey alexy/aske can you help me?

adam.skelton · May 26, 2010, 11:25pm

Hi Amey,
If you are looking for a way to save certain nodes with their formatting to a database then there is a way to achieve this without converting them to HMTL. Please take a look at the documentation here. It shows how you can save a document to a database and then retrieve it. In your situation where you want to save just specific nodes then you can use the same method as above, importing specific nodes to a tempdoc and then saving the tempdoc to a database.
If you really require the nodes to be in pure HTML, you could think about just keeping the tags you need from the converted node and removing any ones you don’t such as “” and ". This should allow you to work with individual nodes in HTML code.
Could you please clarify the problem you are having with setHtmlExportImagesFolder?
Thanks,

amey7p · May 26, 2010, 11:52pm

i dont want images to be stored in local path since i am going to put that on web, local things won’t work for me like C:\temp\xyz.jpg etc…
Also i found issue in Paragraph thing, i my attached document when i use Paragraph there child nodes
I get 3 diff paragraphs for this 3 lines.
sdddd2323 jhjhjhjhhh
sd23235444444444
sdsd
where as those lines are below single Basic Course of Events so i want to club them together.
You can check my attached doc’s 2nd page.
I can’t save document directly to DB.I want output in java HashMap where key will be node name & value will be value against that node
Ex.
{Description=<HTML for Sdsdsdsd Sdssssss+image>,Basic Course of Events=}

alexey.noskov · May 27, 2010, 1:58am

Hi

Thank you for additional information. You can specify ImageFolderAlias, so your images will be available from web. Please see the following link for more information:
https://reference.aspose.com/words/net/aspose.words.saving/htmlsaveoptions/imagesfolderalias/
Also, you should note that HTML and MS Word formats are very different and it is quite difficult and sometimes impossible to preserve all features of MS Word documents in HTML. Here you can find more information about Word document’s features which are supported/unsupported upon exporting to HTML:
https://docs.aspose.com/words/net/convert-a-document-to-html-mhtml-or-epub/
Best regards.

amey7p · May 27, 2010, 2:18am

1:I am not able 2 distinguish between HtmlExportImagesFolderAlias &
HtmlExportImagesFolder , both seems same to me
2: my above doubt still pending why its giving 3 paragraphs to me?

amey7p · May 27, 2010, 5:40am

ok i got some help from HtmlExportImagesFolderAlias property which will avoid local paths now remains issue with paragraphs help me on that.

alexey.noskov · May 27, 2010, 8:11am

Hi

Thank you for additional information. It is perfect that you already resolved the problem with image url.
Regarding paragraphs, since these three items in your document are three paragraphs, Aspose.Words outputs them as three paragraphs. Please see the attached screenshot.
Best regards.

amey7p · May 27, 2010, 8:50am

alexey those are HTML paragraphs ie some HTML text assigned against that node.
So then how can i deal with this?

alexey.noskov · May 27, 2010, 9:17am

Hi

Thank you for additional information. There is no difference how you inserted paragraphs (or other nodes) into the documents, anyways they are regular paragraphs.
When you insert HTML into the document, this HTML is parsed into DOM (Document Object Model), later upon saving DOM is written in the appropriate format, depending on SaveFormat you specified.
Best regards.