We're sorry Aspose doesn't work properply without JavaScript enabled.

Free Support Forum - aspose.com

[Java] read HTML from Doc using Aspose.Words

Hi All,
I am new to Aspose.I went through API documentation.Good documentation :).My requirement is that i want to write some values in word doc(values can have HTML), so for writing HTML i used DocumentBuilder.insertHtml API as well as i want to read some word document which can have HTML in there nodes.I used style mode to extract node values using styles here is my code.

public static ArrayList paragraphsByStyleName(Document doc, String styleName) throws Exception

{
ArrayList paragraphsWithStyle = new ArrayList();
NodeCollection paragraphs = doc.getChildNodes(NodeType.PARAGRAPH, true);
for (Object node : paragraphs)
{
Paragraph paragraph = (Paragraph)node;
if (“Style1”.equals(paragraph.getParagraphFormat().getStyle().getName()) ||
“Style2”.equals(paragraph.getParagraphFormat().getStyle().getName()))
paragraphsWithStyle.add(paragraph);
}
return paragraphsWithStyle;
}

for (Object node : paragraphs)
{
Paragraph paragraph = (Paragraph)node;
System.out.println(paragraph.toTxt());
}

When i run this node’s text value dont give image which i have added :(…neither i am unable to find any API which can return my HTML :frowning:
Can somebody help me?

Thanks & Regards,
- Amey

Hi

<?xml:namespace prefix = o ns = "urn:schemas-microsoft-com:office:office" />

Thanks for your inquiry. I think, you can try using the similar approach as suggested here to achieve what you need:

http://www.aspose.com/community/forums/230130/extract-text-with-formatting/showthread.aspx#230130

Hope this helps.

Best regards.

Hi Alexey for thanks your support but i think it will give me whole doc file in HTML no individual node value in HTML right?

Hi

<?xml:namespace prefix = o ns = "urn:schemas-microsoft-com:office:office" />

Thanks for your inquiry. As I mentioned in the above thread, there is no direct way to get HTML from a particular node of the document. So to achieve this, you should copy the node into an empty document and then convert this document to HTML. So as a result you will get HTML of document, which contains only node, you are interested in.

Best regards.

Thanks again so as per my code above..

Paragraph paragraph = (Paragraph)node;
System.out.println(paragraph.toTxt());

This has to be replaced such that i will have to getchildnodes of each Paragraph & then append in new document right? so for each node i will have to do this right?

Hi Amey,

That is partly correct. You don’t need to get ChildNodes for each paragraph, you can just run ImportNode with the second parameter as true so that it automatically imports all of paragraph’s children as well. You can use the code in the link that Alexey posted above like this:

You should first create a tempory document to import the nodes.

Document tempDoc = new Document();

Then take each node (and its children) while keeping its formatting by using this line of code:

tempDoc.FirstSection.Body.AppendChild(tempDoc.ImportNode(node, true, ImportFormatMode.KeepSourceFormatting));<?xml:namespace prefix = o ns = "urn:schemas-microsoft-com:office:office" />

This will import the node into the temporary document that you create. The true parameter means it copies the node’s children as well, ImportFormatMode.KeepSourceFormatting means that the formatting of the node is kept from the original document.

You can then send your tempDoc to Alexey’s ConvertDocumentToHtml method which will return the HTML of that node and its children.

Please ask if you need any further help.

Thanks,

for line

tempDoc.FirstSection.Body.AppendChild(tempDoc.ImportNode(node, true,ImportFormatMode.KeepSourceFormatting));

how node variable should be initialized?

Hi Amey,<?xml:namespace prefix = o ns = "urn:schemas-microsoft-com:office:office" />

The node is what you need to import into a new temp doc so you can save the output as HTML. Remember that paragraph is a node, so using the code included in your first post it would look like this:

NodeCollection paragraphs = doc.getChildNodes(NodeType.PARAGRAPH, true);
for (Object node : paragraphs)
{

tempDoc.FirstSection.Body.AppendChild(tempDoc.ImportNode(node, true, ImportFormatMode.KeepSourceFormatting)

}

}

Thanks,

Thnx a lot…

I did it & it give me o/p of all nodes in HTML with some HTML head body tags.
But i have slight diff req…i want to extract individual node HTML not full document HTML & that individual node HTML should not contain any additional HTML tags
I slightly changed my code…
For each paragraph i am calling following code:
Document temp = new Document();
temp.getFirstSection().getBody().appendChild(temp.importNode((Node)node, true,
ImportFormatMode.KEEP_SOURCE_FORMATTING));
String html = ConvertDocumentToHtml(temp);
System.out.println(“html=”+html);

but this is also not giving me exact thing. :(…
Also for images it copies them into setHtmlExportImagesFolder folder & modified this image path as well.Some other workaround to copy images?
My requirement is that i have some cells in doc they have some style & values against them they are having diff styles.
Now in values against cell somebody can add HTML as well, text+image.
My requirement is that i want to extract this cell names & corresponding values against them(using paragraph styles) & save it some Map, so i can iterate through this map & save this values in Database & while rendering on page this HTML will be displayed :).
Attaching sample file as well which i want to read on styles.


Hi

<?xml:namespace prefix = o ns = "urn:schemas-microsoft-com:office:office" />

Thank you for additional information. Unfortunately, there is no other way to get HTML of a particular node.

Also, as I can see your document contains FormFields. I suppose, you just need to get values of these FormFields. If so, please see the following link to learn how to work with FormFields:

http://www.aspose.com/documentation/.net-components/aspose.words-for-.net-and-java/working-with-form-fields-1.html

Best regards.

Thanks but i am interested in extracting

Sdsdsdsd

Sdssssss & image & rest info
from Description(along with proper fonts & text size +image) &

sdddd2323 jhjhjhjhhh

sd23235444444444

sdsd

from Basic Course of Events(along with proper fonts & text size +image)
Both present on 2nd page

hey alexy/aske can you help me?

Hi Amey,

If you are looking for a way to save certain nodes with their formatting to a database then there is a way to achieve this without converting them to HMTL. Please take a look at the documentation here. It shows how you can save a document to a database and then retrieve it. In your situation where you want to save just specific nodes then you can use the same method as above, importing specific nodes to a tempdoc and then saving the tempdoc to a database.

If you really require the nodes to be in pure HTML, you could think about just keeping the tags you need from the converted node and removing any ones you don’t such as “” and ”. This should allow you to work with individual nodes in HTML code.

Could you please clarify the problem you are having with setHtmlExportImagesFolder?

Thanks,

i dont want images to be stored in local path since i am going to put that on web, local things won't work for me like C:\temp\xyz.jpg etc..
Also i found issue in Paragraph thing, i my attached document when i use Paragraph there child nodes
I get 3 diff paragraphs for this 3 lines.
sdddd2323 jhjhjhjhhh
sd23235444444444
sdsd
where as those lines are below single Basic Course of Events so i want to club them together.
You can check my attached doc's 2nd page.
I can't save document directly to DB.I want output in java HashMap where key will be node name & value will be value against that node
Ex.
{Description=<HTML for Sdsdsdsd Sdssssss+image>,Basic Course of Events=<HTML for sdddd2323 jhjhjhjhhh sd23235444444444 sdsd>}

Hi

<?xml:namespace prefix = o ns = "urn:schemas-microsoft-com:office:office" />

Thank you for additional information. You can specify ImageFolderAlias, so your images will be available from web. Please see the following link for more information:

http://www.aspose.com/documentation/.net-components/aspose.words-for-.net-and-java/com/aspose/words/saveoptions.html#HtmlExportImagesFolderAlias

Also, you should note that HTML and MS Word formats are very different and it is quite difficult and sometimes impossible to preserve all features of MS Word documents in HTML. Here you can find more information about Word document’s features which are supported/unsupported upon exporting to HTML:

http://www.aspose.com/documentation/.net-components/aspose.words-for-.net-and-java/save-in-the-html-format.html

Best regards.

1:I am not able 2 distinguish between HtmlExportImagesFolderAlias &

HtmlExportImagesFolder , both seems same to me
2: my above doubt still pending why its giving 3 paragraphs to me?

ok i got some help from HtmlExportImagesFolderAlias property which will avoid local paths now remains issue with paragraphs help me on that.

Hi

<?xml:namespace prefix = o ns = "urn:schemas-microsoft-com:office:office" />

Thank you for additional information. It is perfect that you already resolved the problem with image url.

Regarding paragraphs, since these three items in your document are three paragraphs, Aspose.Words outputs them as three paragraphs. Please see the attached screenshot.

Best regards.

alexey those are HTML paragraphs ie some HTML text assigned against that node.

So then how can i deal with this?

Hi

<?xml:namespace prefix = o ns = "urn:schemas-microsoft-com:office:office" />

Thank you for additional information. There is no difference how you inserted paragraphs (or other nodes) into the documents, anyways they are regular paragraphs.

When you insert HTML into the document, this HTML is parsed into DOM (Document Object Model), later upon saving DOM is written in the appropriate format, depending on SaveFormat you specified.

Best regards.