Reading Html error

tayshuyih · January 19, 2006, 2:15am

Hi,

I am using aspose to read a html document like following

docWord = New Aspose.Word.Document(File.FullName)
’ Get strText using Document.Range.Text
strText = docWord.Range.Text()

Though, Range.Text is returning only first few lines of the html documen - until the CSS style tag not the full document

Please help

Thanks

Shu Yih

miklovan · January 19, 2006, 2:54am

CSS import is currently not supported although we are planning to add this support in the nearest future.

tayshuyih · January 19, 2006, 4:06am

So if a html with tag that is not supported, it will just stop “reading” at the error tag?

Not to mention tag is quite common in html, shouldnt be Aspose at least returning any part following that tag that is valid? i.e. If it is not supported, it is ok to have the style tag and its properties returning inside the string, but all the tags subsequently e.g. table should be processing correctly right?

Does It means I wouldnt be able to use aspose to read the html document whenever there is a tag in the html that is not supported?

Thanks

Shu Yih

DmitryV · January 19, 2006, 7:12am

No, Aspose.Word won’t stop HTML import when an unsupported tag is encountered. Your HTML is formed incorrectly and that is the cause of the error. Notice, for example, that the very first table seems to have duplicated
tags but missed pair. I’ve corrected this and attached the working document. Not sure however if these were the only errors in your HTML. Please double check it if you still consider the output to be different from expected.

tayshuyih · January 19, 2006, 6:34pm

Hi,

Thanks for the prompt reply. I have tested with the html you attached. It looks like the string " Email: aaaa@yahoo.com Tel: 1241242(Home) , 522353(Mobile)" repeated several time.

In fact, I am using Aspose to read lots of html and Words document and process on them. What my expected output is whatever i will see if i open the same html
with internet browser. There might be some HTML formatting error there, but if
an IE or Firefox can open it and parse it correctly, I would expect the same
result from aspose. Aspose should be as relaxing in its parsing as most modern browsers. Aspose is to replace Microsoft Word or internet explorer right?

Thanks again

Shu Yih

tayshuyih · January 19, 2006, 6:49pm

Hi,

I just tried to rename extension of the html file to .doc and open with Microsoft Word. It doesn’t returning the exact same output from a browser reading its html and OpenOffice is doing better. But at least better than what Aspose returning for either .doc or .html version

Thanks

Shu Yih

DmitryV · January 20, 2006, 1:21am

Agreed. HTML import is one of the highest priority things we are currently working on. I believe we will improve it in the near future, making it closer to what Word or IE return.

tayshuyih · January 22, 2006, 10:47pm

Is there an estimated date on when it will be supported? I would need that for my release plan. My application is greatly dependent on reading Words and Html correctly.

Thanks

Shu Yih

DmitryV · January 23, 2006, 5:03am

I can’t provide an estimated time on making HTML import completely similar to Word because it’s too vast task. We are working on making the HTML provided by you to be read properly by allowing badly formatted HTML and this will take several weeks I believe.

alexey.noskov · November 22, 2007, 8:12am

We have released a new version of Aspose.Words that contains a fix for your issue.
The new version of Aspose.Words is available for download from here.
Best regards.