Links in HTML

rhrufftx · March 16, 2007, 10:45am

We are using Aspose.Words to extract text from HTML documents. Unfortunately, we have found that extraction can 'hang' indefinitely if there is no internet connection. Apparently, Aspose.Words is trying to download every graphic referenced in the HTML. This is very strange behavior, and it renders the component unusable for us as we often run this on machines with no connectivity. In any case, we are only interested in text, not pictures, AND: why wouldn't the original links just be preserved rather than downloading and embedding the content?

TIA.

miklovan · March 17, 2007, 9:57pm

Yes, it seems reasonable. At least we can probably provide the option not to resolve the referenced images to avoid situations like this. I shall discuss this matter with other members of the team and let you know.

miklovan · March 19, 2007, 11:46pm

I have logged this problem to our defect base as issue #1524. We will try to fix it in the next version of Aspose.Words. Thank for drawing our attention to this matter.

Best regards,

romank · March 21, 2007, 10:31pm

I think this was discussed in a neighbouring topic, posted by you as well.

Aspose.Words primary design goal is not HTML to TXT conversion. It's primary goal is to support all Microsoft Word document formats (DOC, RTF, WordML) very well and support other formats where possible (HTML, TXT).

When Aspose.Words loads HTML, it loads it into an rich document object model that can store formatting, pictures, shapes etc. Therefore Aspose.Words attempts to load as much information from HTML as possible, including images and formatting.

Aspose.Words was not designed with the goal in mind to strip everything except text from HTML. So that explains why using Aspose.Words just to convert HTML into TXT is probably a waste of resources.

We will consider adding an option not to actually load images during HTML import, but create "link only" shapes in the document. But still, having said what I said, HTML to TXT is best done with HtmlAgilityPack, not with Aspose.Words. If you agree, I will not make this option to ignore loading images high priority.