Hi,
We operate an insurance system which currently creates and manages all its documents in HTML. As you might imagine these are specifically laid out and formatted as they need to be. We now need to migrate the system to use MS Word.
We are evaluating Aspose.Words to convert the HTML documents into MS Word as this will avoid us having to re-write the hundreds of existing HTML documents templates into MS Word Templates.
Now at first test the MS Word document doesn’t look very much like the original HTML document. I note the comment in FAQ regarding the CSS support and I guess my question is, is it going to be practical to use Aspose.Words to accomplish what we want or is it going to take too much work?
I have attached the resulting word document and also a copy of the HTML in PDF format (OK second attachment added in next post). The differences I note at first glance are:
Image size is currently specified in CSS which is not working.
Page margin sizes are incorrect
Fonts used are incorrect
Table borders are being displayed instead of hidden
Absolutely positioned text is not being positioned correctly nor aligned correctly (Footer on page 1).
Bold text is not coming out bold.
Some centered text is not being centered.
Page breaks are not being honoured
So is there a practical way to overcome these issues? Or alternatively is there some kind of intermediate format I could use that would assist with preserving the look of the document. I don’t need to ever export it to HTML again, once its a MS Word document which looks as it should it can stay that way.
Any suggestions appreciated.
Thanks,
Dale
Hello Dale.
Thank you for your interest in Aspose products.
Sorry, I don’t see the second attachment. Here is only the PDF document. Please provide us an HTML sample(s) showing up the issues. To attach multiple files you can archive them with any popular file archiver. Usually ZIP or RAR are used for that purpose. It’s a forum restriction to only one attachment per post. When attaching a non-archived HTML please change its extension to somewhat acceptable, for instance “.txt”. This is also a forum restriction.
The most of the issues you are addressing here should occur because Aspose.Words doesn’t support reading CSS styles. This feature is planned for development and hopefully will be implemented in 2008. I will investigate each issue and provide you more information on why it happens and how can be overcome.
Regards,
Hello Dale!
Thank you for additional materials.
As I see from your explanation you are converting HTML to DOC. And PDF is only intended to show how it should look. What was strange for me is that HTML cannot be rendered in browser. It gives an empty page. I’m using Avant based on IE7. When we tried FireFox it was rendered okay. But note it is not good if Avant and IE cannot render this document. There are some deviations from HTML standard detected.
For instance it was:
Dale Burrell-Sansha/07:Test Certificate
But should be:
Dale Burrell-Sansha/07:Test Certificate
The most of addressed issues show up because Aspose.Words doesn’t support CSS import. This is known as issue #40 in our defect database. Once we support this I guess paragraph styles and character styles will be implemented first of all. Such features as image size defined in CSS will be considered after that since it is a relatively rare case. Theoretically not all CSS standard could be imported to MS Word. Some of its features cannot be represented.
Floating content in HTML is currently not supported. The issue is known as #4488. But first it is intended to implement in HTML export. I’ll link this thread to the issue so we could consider import too.
Regarding page breaks in HTML I’m not sure what you mean. HTML document is one page itself. You can carry out an experiment creating a document in MS Word with only two paragraphs divided by a page break. If you save it as HTML with MS Word, then open in browser, you will see the two paragraphs divided vertically as if there was an empty paragraph between them.
I noted that HTML was generated by some tool, not designed be hand. Maybe we could suggest anything using that original tool. You can also consider alternative solutions for parsing and converting this HTML. For instance using HtmlAgilityPack. If you provide us more information we’ll try to advise more.
Regards,
Hi - and thanks for looking into this for me - very much appreciated!
You are correct, it was the title that was causing the problem with displaying in IE - and that was because I edited the file by hand to ensure it worked stand alone for this demonstration, because as you note we are generating it automatically.
Just to clarify you can make a page break in HTML as we are doing with the following:
Any suggestions you have that might enable us to use our HTML with Aspose.Words would be much appreciated. Just to note, we wouldn’t want to make too many changes to the HTML because most of that is stored in the database as document templates, and if we had to modify it too much then we might as well create MS Word Templates - which we know will work.
HtmlAgilityPack looks interesting… although its not clear to me what I should do with it? I had also wondered if there is another tool out there which converts HTML to WordML or similar but I guess that is pretty much what you are doing with Aspose.Words.
Please do let me know if you have any further suggestions and if you can suggest what I might try and accomplish using HtmlAgilityPack.
I am also wondering if there is an intermediate file format that might help convert the HTML to MS Word - e.g. is it possible to convert PDF to MS Word since we already have the PDF looking correct?
Hi Dale!
Of course I know about “page-break-before” attribute. That’s good if you can use it this way in your system. I investigated how MS Word does page breaks in HTML. They practice their own magic. Currently Aspose.Words outputs page breaks as follows:
I cannot suggest any ready solution to read documents with CSS formatting. HtmlAgilityPack is a library that can help you develop a custom HTML parser. I just brought an example. But I’m not sure this could be easier than creating MS Word templates manually. Maybe we could suggest anything regarding the application that you were using to generate these HTML documents. Is it still available? Can you anyhow control the process of generation?
HTML to WordML conversion is performed by Aspose.Words too. That’s really doesn’t matter what output format is. Conversion is made in two steps: reading one format and writing another. Since we cannot read HTML with CSS styles properly we couldn’t convert it to any format.
Conversion from PDF to DOC is theoretically possible. We do research on it. But this is the matter of far outlook. As you know PDF format is “printer-oriented” where much logical information is lost. Proper conversion from PDF to any native MS Word format is a very complex task concerning artificial intelligence. I never saw good converters. That’s why we don’t promise to develop our own converter soon.
I’m sorry if I cannot advise anything practical for this case.
Regards,
Hi Klepus,
Thanks for the reply - I only just noticed it as I thought I’d get an email to notify me.
Anyway, did you see my private message? I can’t find where I would view private messages if I had one nor when I can see ones I’ve sent.
Since then I’ve looked at as many components as I can find that do something similar and you guys seem to be the only people considering supporting HTML styles and classes.
So I was wondering if there might be some way we could speed up the development on the parsing of HTML styles when converting into MS Word. Whether we paid you something, whether we promised you a certain number of sales or whether we wrote some of the code.
Feel free to contact me directly dale@jatech.co.uk
Cheers,
Dale
Hello!
Thank you for your proposals.
We usually get notifications on private messages. Have I missed to answer anything? I have checked them, found a message from you, but it was answered.
That’s a difficult question how to speed up our development. I just can approve that we are going to implement loading CSS. This task is relatively important for us but you might expect in general it is also very complex. I think when some beta is available you would be able to participate in testing. There are many “degrees of freedom” in both HTML and CSS. So it could be difficult to cover them fully. If you show us documents typical for you then we can fix issues closer addressing your particular requirements.
I linked this thread with issue #40 and will probably consult with you during the implementation. Your feedback is important for us.
Thank you again.
Regards,
I think the problem is that I can’t access my private messages - can us simple users access those and if so where?
The example HTML is attached in the zip file earlier in this thread.
It would make a big difference to us if the basic formatting styles were translated into MS Word, bold, italic underline, font-family etc. Also the page-break-before. Ideally this information would be gleaned from the class rather than an inline style but I guess that could be difficult.
I wonder if some of the code from a project such as Firebug (for Firefox) would be of help to you? Firebug allows you to inspect all the styles of an HTML document and works out which classes are in effect etc
We would be very happy to be involved with any testing.
We’ll consider your kind advice.
To see your private messages try to navigate this to link: http://www.aspose.com/community/user/privatemessages/default.aspx
You should have access to your messages. If this doesn’t work for you then it could be a defect of our site and I’ll contact appropriate person for fixing.
Have a nice day!
Yes - that link works fine… but there seems to be no way to navigate to that page from my account? So at the moment the only way I can see to access my private messages is to manually enter that URL - is that as intended?
Hello!
I was also unable to find them via navigation and had to type the URL. Our site is being reconstructed gradually and maybe they overlooked this. I wrote our admin about the private messages and they’ll address the issue. Currently you can save my link somewhere and navigate it if needed. Sending or looking private messages is a quite rare case.
Regards,
Hi Guys,
OK we are now live - we have purchased Aspose.Words.Net and are running the latest version.
So we are now eagerly awaiting further support for CSS styles either inline or via class, either would be great.
As before we are happy to assist with development of this if there is a way we can speed things up.
An immediate question, how can I center things in HTML such that they are centered in Word? I’ve tried the style text-align, I’ve tried the nasty tag, I’ve tried a table with align=center but nothing seems to make my content appear centered. I have a table that I wish to be centered on the page, and the content within the table to be centered within the table cells. Any suggestions?
Cheers,
Dale
Hi
Thanks for your inquiry. Unfortunately there is no way to insert centered table using HTML into the Word document. But you can set alignment after inserting HTML content. For example see the following code:
// Create document and DocumentBuilder
Document doc = new Document();
DocumentBuilder builder = new DocumentBuilder(doc);
// Insert HTML
string html = File.ReadAllText(@"Test123\test.html");
builder.InsertHtml(html);
// Get collection of tables from the document
NodeCollection tables = doc.GetChildNodes(NodeType.Table, true);
// loop through all tables
foreach (Table tab in tables)
{
// Loop through all rows and set its alignment
foreach (Row row in tab.Rows)
{
row.RowFormat.Alignment = RowAlignment.Center;
}
}
// Save output document
doc.Save(@"Test123\out.doc");
If you need also have centered text inside table cells you can use
tags as shown below. <td><p style='text-align:center'>test</p></td>
I hope this could help you.
Best regards.
Hello!
I would like to clarify that the case with centering table in HTML refers to a known issue #4247 in our defect database. We’ll notify you on any progress with it. #4247 – Import/export HTML align (and possibly CSS text-align) attribute on a table.
Regards,
Hello!
Thank you for your patience.
Unfortunately table alignment is not yet round-tripped via HTML. I apologize for inconvenience. As a workaround you can create a table with invisible padding columns so the semantically meaningful columns will move to the right.
Regards,
Sets consent for sending user data to Google for online advertising purposes.
Sets consent for personalized advertising.
Cookie Notice
To provide you with the best experience, we use cookies for personalization, analytics, and ads. By using our site, you agree to our cookie policy.
More info
Enables storage, such as cookies, related to analytics.
Enables storage, such as cookies, related to advertising.
Sets consent for sending user data to Google for online advertising purposes.
Sets consent for personalized advertising.
Cookie Notice
To provide you with the best experience, we use cookies for personalization, analytics, and ads. By using our site, you agree to our cookie policy.
More info
Enables storage, such as cookies, related to analytics.
Enables storage, such as cookies, related to advertising.
Sets consent for sending user data to Google for online advertising purposes.