Parsing HTML documents

I would like to use aspose for mainly for parsing and I have the following questions :


1) It’s possible to parse word documents as well as pdf but I couldn’t figure out how to parse HTML. I tried to parse html with the same code than for word document but it didn’t work that well. I got all the links for example and lot of blank space etc… So I would like to know if there is an efficient way to do it.

2) Concerning pdf and word text extraction is it possible to extract the title. I know that for pdf there are the meta data but usually the title in the metadata has nothing to do with the title of the document.

Thanks.

Hi Fabien,

1. Can you please share your HTML to reproduce the issue?

2. Please check the following links for more details on how to do this with Aspose.Words and Aspose.Pdf respectively.

http://www.aspose.com/docs/display/wordsjava/Working+with+Document+Properties

https://docs.aspose.com/pdf/java/get-pdf-information/

Best Regards,

The page is this one:




and here is part of the result (I didn’t paste the full result but just a snippet so that you can see the problem) :

HYPERLINK "http://app.readspeaker.com/cgi-bin/rsent?customerid=7764&lang=fr_be&readid=js-article&url=www.rtbf.be%2Finfo%2Fmonde%2Fdetail_attaque-a-tunis-19-morts-et-44-blesses-dont-un-belge%3Fid%3D8934984" Écoutez Dans un enregistrement audio, le groupe terroriste État islamique a revendiqué jeudi l'attaque de la veille au musée Bardo à Tunis. Parmi les touristes tués, il y a une victime belge. HYPERLINK \l "newsImagesPane" Des policiers gardent l'entrée du musée Bardo de Tunis - FETHI BELAID - BELGAIMAGE Mots clés

Hi Fabien,

We are investigating the issue and will update you soon.

Best Regards,

Hi Fabien,

The page you mentioned in your post is a complex page and not all features are supported by Aspose.Words for Java while converting HTML to other formats. Please check `http://www.aspose.com/docs/display/wordsjava/Load+in+the+HTML+%28.HTML,+.XHTML,+.MHTML%29+Format` and `http://www.aspose.com/docs/display/wordsjava/Save+in+the+HTML+%28.HTML,+.XHTML,+.MHTML%29+Format` for more details on the features supported by Aspose.Words for Java.

Can you please share a screenshot and your code to reproduce the issue you have mentioned in your post?

Best Regards,