We're sorry Aspose doesn't work properply without JavaScript enabled.

Free Support Forum - aspose.com

Java - extract text from html files

Hi, I am trying to extract text from an html file using aspose-html library for java. I couldn’t find anything in the api to extract text, could someone point me to any example code or documentation?

com.aspose.html.HTMLDocument document = new com.aspose.html.HTMLDocument(new ByteArrayInputStream(inputFile), "about:blank");

@nitnamby

Do you want to extract the HTML source or only rendered text? Can you please share a sample HTML in .zip format along with your expected output result? We will test the scenario in our environment and address it accordingly.

I am looking to extract the rendered text, not the source. You can use any html file. I can only see an option to write html text to a file in the api, https://blog.aspose.com/html/extract-text-html-java/. Why can’t I get the extracted text back, similar to PDF text extraction?

@nitnamby

The option is to convert HTML document into .txt format. You can then read the saved .txt file and it will be similar to extracting text from the HTML. It would involve extra steps i.e. reading text files from the disk, but it would serve the purpose. Please let us know whether it suits you or not. We will log an investigation ticket to further investigate your requirements and share the ID with you.

I don’t want to write it to disk as my application doesn’t store the extracted text, the application only need to return the extracted text as part of the api response. Also the aspose pdf and word libraries provides api to extract text directly, not sure why the html library doesn’t give that functionality.

@nitnamby

We have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.

Issue ID(s): HTMLJAVA-1442

You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.