Wrong font writing HTML

I am writing a Word document from a string of HTML, see below. The HTML uses font Arial, in the code I am explicitly setting the document builder font to Arial. However the first part of the document is converted as Arial but part way through the font changes to Times New Roman. (The font changes after the table cell containing the image)
The code I am using is as follows:

Document doc = new Document();
DocumentBuilder builder = new DocumentBuilder(doc);

// Set the paper size and margins
builder.getPageSetup().setPaperSize(PaperSize.A4);
builder.getPageSetup().setLeftMargin(71);
builder.getPageSetup().setRightMargin(71);
builder.moveToDocumentStart();
builder.getFont().setName("Arial");
builder.getFont().setSize(11.0);
builder.insertHtml(htmlText);

The HTML text is as follows:

<html xmlns="<A href=" http: //www.w3.org/1999/xhtml">
http://www.w3.org/1999/xhtml</A>">
<head>
    <style type="text/css">
        body {
            font-family: Arial;
        }

        h1 {
            font-family: Arial;
            font-size: 24pt;
            font-weight: normal;
            color: #003366;
        }

        p {
            font-family: Arial;
            font-size: 11pt;
            font-weight: normal;
        }

            p.fineprint {
                font-size: 8pt;
                text-align: center;
            }

        span.comment {
            border: solid 1px #FFFF00;
            background-color: #FFFFCC;
        }
    </style>
    <meta name="generator" content="EditLive! 6.3.3.69" />
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
</head>
<body>
    <table border="0" cellpadding="0" width="100%" cellspacing="0">
        <tr>
            <td valign="middle" width="42%">
                03 May 2008<br />
                <br />
                <br />
                LPM Bohemia - The Tent Company<br />
                The Aga Buildings,<br />
                Lamberhurst Road,<br />
                Kent.<br />
                TN12 8DP<br />
                <br />
                P2P000001
            </td>
            <td width="58%"><img alt="" src="<A href=" file: ///C:/Select/Applications/Planned2Perfection/Images/p2pLogo.jpg">file:///C:/Select/Applications/Planned2Perfection/Images/p2pLogo.jpg</A>" /></td>
        </tr>
    </table>
    Dear ,
    I am pleased to confirm a booking with you as follows:
    <table border="0" cellpadding="0" width="704" cellspacing="0">
        <tr>
            <td width="94"><strong>Date:</strong></td>
            <td width="612">Thursday, 17 April 2008</td>
        </tr>
        <tr>
            <td valign="top" width="94"><strong>Venue:</strong></td>
            <td width="612">
                The Dorchester<br />
                Park Lane,<br />
                London.<br />
                W1A 2HJ
            </td>
        </tr>
        <tr>
            <td valign="top" width="94"><strong>Room:</strong></td>
            <td width="612">&#160;</td>
        </tr>
        <tr>
            <td width="94">&#160;</td>
            <td width="612">&#160;</td>
        </tr>
        <tr>
            <td width="94">&#160;</td>
            <td width="612">&#160;</td>
        </tr>
        <tr>
            <td width="94">&#160;</td>
            <td width="612">&#160;</td>
        </tr>
    </table>
    &#160;
</body>
</html>

Hello Mark!
Thank you for your inquiry.
CSS styles are currently not imported by Aspose.Words. This feature is planned for development in .NET version and when we complete it we’ll port implementation to Java. Currently you can use only direct formatting.
I’ll investigate what happens near the cell with image and provide you more information. Maybe we’ll suggest some easy workaround.
Regards,

Hello!
I have created issue #4998 in our defect database. This should be first fixed in .NET version and than ported to Java. To reproduce the issue smaller sample is enough, no images needed at all:

In the table

After the table
Formatting looses after the table. So as a workaround you can insert HTML by smaller parts: each table and contents between the tables separately.
Regards,

Thanks for your reply however I’m not clear what you are suggesting as a work around, can you provide an example of how to make Aspose re-apply the correct font after the inclusion of the table. It is quite possible for the HTML to contain a number of tables.
Thanks
Mark

Hello again!

  1. You can split HTML into smaller parts at points where tables end. Finding particular tags in HTML is relatively easy if you are familiar with regular expressions.
  2. You can insert the whole HTML and re-apply lost attributes right in the document model. But in this case you should know where and what to apply. If the whole HTML should be equally formatted, say Arial 11, that’s also relatively easy. To find the place where it is inserted you can suggest insertion into a bookmark.
    Regards,

Thanks for this. However I may not always know the font that the user wants to use in the HTML. As a work around the HTML can be changed to explicitly set the font but the users of my application may not be too familiar with HTML so this is far from ideal.
Can you tell me when there will be a fix for this bug as it is urgent that I have a working version for my application.
Thanks

Hello Mark!
.NET version is our mainstream. We first implement new features and fix defects in .NET and after that port changes to Java version. The very next release for .NET is coming soon, in a few days. So I cannot promise to fix the issue in it. From my experience HTML parsing might lead to much development effort. Let’s orient on the after-next release. It is about 1 month. Then porting to Java might take considerable time. As I know changes are ported in considerably large granularity, not one defect at once. Anyway you might have to be waiting for several months for this fix.
Let’s try workarounds. I realize that users are usually not so smart to change anything in HTML. Please tell us how your application looks for the end users. I expect they feed some HTML documents as they are to your application. They don’t know that Aspose.Words works inside the engine. Also initial formatting (in the sample Arial 11) is taken from somewhere. It is not hardcoded.
If so you can preprocess the input HTML:
- split it into parts at the points where tables end,
- set initial formatting before inserting each part,
- call insertHtml for each of those parts.
As another workaround you can insert HTML as is but after that change anything you need in the document model. This depends on the task specifics. Workarounds don’t solve general case but might be useful in some conditions.
Regards,

My application contains a WYSIWYG HTML editor that the users use to create letter templates, this editor generates the HTML so the end-user does not need to understand or use HTML. I then use Aspose Words to convert this into Word. The entire process is seamless to the user. As such it is not really an option to make changes to the HTML as I have no idea what the structure of the HTML will be.
I have also spotted another issue. In the HTML there is an address in one of the table cells, each line of the address is seperated by a
tag so that it appears on a new line. This works, however the second and subsequent lines all have a leading space which makes the lines appear indented. There is no space in the HTML and this spoils the formatting of the address. Can this be prevented?

Hello Mark.
That’s really difficult to predict what output could be from that third-party HTML editor. As you can see Aspose.Words supports HTML import with some restrictions, not perfectly. We don’t brighten up. So we cannot guarantee that everything will be converted well. If you know specific issues in this system then you can try programmatic workarounds. I don’t state this is very nice but I’d like to show possible ways of resolution. HTML is a non-strict format since it allows many “degrees of freedom”. But you can utilize software libraries such as HtmlAgilityPack to parse and preprocess it. Namely HtmlAgilityPack is designed to recognize many format deviations and dialects. It is very steady to be recommended. I’m sorry again for that any workarounds are needed at all.
The issue with accidental spaces after
tags is already fixed in .NET version 5.0.2. The number is #4424. It will be ported shortly. I’ll ask our Java developer to revise this thread and provide you time estimate for that.
Regards,

Is there any update on when the new release will be available for Java?

Hi
Thanks for your inquiry. We plan to release the next version of Aspose.Words for java within month.
Best regards.