Save HTML fragment

macory · July 15, 2011, 11:53am

Hi,
I have an unusual requirement. I need to extract parts of a Word document, then save these parts as an HTML fragment (as HTML, but without the html and body tags).
I can easily extract the parts I need from the Word document, import each one into a new Aspose Words Document object, then save as HTML. It works well. However, if I use Java code to remove the html and body tags then I run into a world of character conversion issues.

Any ideas how Aspose Words might make this task easier?

Thanks.
MC

AndreyN · July 15, 2011, 12:35pm

Hello
Thanks for your request. I think in this case you should save the document as HTML into a stream, then get HTML string from this stream and then remove an unneeded part from this HTML string.
Best regards,

macory · July 15, 2011, 12:49pm

Andrey,
Yes, that is exactly what I have been trying to do - with limited success.
The problem is that as soon as you read the .html file into a Java stream, you are forced to do character conversion (if you want to get it into a String). Once you do this then you run into conversion issues with characters like the double left quotes from Word.

Thanks for the response. I will work it out.

MC

macory · July 15, 2011, 1:53pm

I sort of mis-spoke in my last reply. I am going straight from the Aspose Words Document to a stream (no save to the file system in between)…

ByteArrayOutputStream baos = new ByteArrayOutputStream();
doc.save(baos, SaveFormat.HTML);
String sHtml = baos.toString();

The problem here is that there is always a Charset behind the scenes (in this case UTF-8 by default) which results in characters like " (open double quotes) being translated into something else.
At this point I am forced to map such a character back to “”". I did not want to have to do such mappings myself.

MC

alexey.noskov · July 18, 2011, 2:28am

Hi
Thanks for your request. Please try specifying encoding explicitly when convert stream to string. Please see the following code:

ByteArrayOutputStream baos = new ByteArrayOutputStream();
doc.save(baos, SaveFormat.HTML);
String sHtml = baos.toString("UTF-8");

Hope this helps.
Best regards,

macory · July 18, 2011, 9:55am

Yes, I already tried that… and every other Charset I could think of such as US-ASCII, ISO-8859-1 and UTF-16BE.
If you take a stream in Java, with HTML generated by Aspose Words, and convert it to a String, there are a whole host of characters which cannot be mapped (for lack of a better term). Characters like the Unicode representation for the left double quotes get turned into strange characters.
I think what you have is the Unicode HTML escape sequence in your HTML for characters like the left and right double quotes. When you convert this to UTF-8 it becomes some other character. I believe this is because UTF-8 simply has no concept of Unicode HTML escape sequences.
The open double quotes character, for example may be represented in HTML with the follwing series of ASCII characters - " I think in your HTML you used the Unicode number for this - 8220 decimal or 0x201C in hexadecimal.
What I have done since I sent the first email was to parse the HTML String and map these characters back to the HTML escape sequences.
It works fine, but I did not want to have to take on such a mapping; who knows how long the list of characters is that I might encounter.
Thanks.
MC

macory · July 18, 2011, 9:56am

… somehow my last post got a bit garbled.

alexey.noskov · July 18, 2011, 10:27am

Hi
Thanks for your request and additional information. As I can see in my case left and right double quotes are not escaped at all. Here is HTML produced on my side used the code I mentioned:

<html>
<head>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
    <meta http-equiv="Content-Style-Type" content="text/css" />
    <meta name="generator" content="Aspose.Words for Java 10.2.0.0" />
    <title></title>
</head>
<body>
    <div>
        <p style="font-size: 11pt; line-height: 115%; margin: 0pt 0pt 10pt">
            <span style="font-family: Calibri; font-size: 11pt">"</span><span style="font-family: Calibri;
font-size: 11pt">test</span><span style="font-family: Calibri; font-size: 11pt">"</span>
        </p>
    </div>
</body>
</html>

Could you please attach your document here for testing? Maybe I am testing with other characters.
Best regards,

macory · July 18, 2011, 10:44am

Yes, Here is a test Word document I have been using.

alexey.noskov · July 18, 2011, 10:55am

Hi
Thank you for additional information. Still everything works as expected. I saved the output string in the file (see the attachment.). As you can see HTML looks correct. Maybe these characters are broken upon further processing on your side.
Best regards,

macory · July 18, 2011, 11:11am

As you know the devil is in the details. Could I please correspond directly with you? I dont want to write abook here in the forum.

macory · July 18, 2011, 11:36am

I went back to an earlier version of my test code and examined the HTML String before writing it to file. It looks OK - I see the double quotes in the HTML String. So the issue must be in the way I save the String to file.
Currently I am using RandomAccessFile. This must be the issue…

RandomAccessFile raf = new RandomAccessFile(outPath, "rw");
raf.writeBytes(sHtml);
raf.close();

Thanks for your help.
MC

macory · July 18, 2011, 12:09pm

OK - I’ll try to make this my last word on the subject. If I look at the HTML String produced from your code example it looks OK. That is, I see left double quotes where there should be left double quotes. I can save the String as a .html file and I still see the left double quotes in the file.
However, if I give this file to a modern browser (Firefox or IE for example), it does not render this character as a left double quote. The character in my example is 0x93. In an ASCII character chart (such as http://www.danshort.com/ASCIImap/indexhex.htm) you will find this as left double quote - however, I believe that characters in the range of 0x80-0x9f are not valid in HTML.
So, I am back to mapping these characters after converting to a Java String using UTF-8.
Thanks for your help.
MC

alexey.noskov · July 18, 2011, 3:03pm

Hi
Thank you for additional information. Even when you convert the document to HTML using MS Word these characters are not replaces with numeric character reference or HTML entity. Since Aspose.Words mimics MS Word, I suppose the behavior is correct.
By the way I tried opening HTML with left and right double quotes in Fire Fox, IE and Google Chrome on my side and these characters are displayed properly.
Best regards,