Java - extract text from html files

nitnamby · February 27, 2023, 2:46pm

Hi, I am trying to extract text from an html file using aspose-html library for java. I couldn’t find anything in the api to extract text, could someone point me to any example code or documentation?

com.aspose.html.HTMLDocument document = new com.aspose.html.HTMLDocument(new ByteArrayInputStream(inputFile), "about:blank");

asad.ali · February 27, 2023, 7:47pm

@nitnamby

Do you want to extract the HTML source or only rendered text? Can you please share a sample HTML in .zip format along with your expected output result? We will test the scenario in our environment and address it accordingly.

nitnamby · February 27, 2023, 11:20pm

I am looking to extract the rendered text, not the source. You can use any html file. I can only see an option to write html text to a file in the api, Extract Text from HTML Programmatically in Java | Retrive Text from HTML Webpage. Why can’t I get the extracted text back, similar to PDF text extraction?

asad.ali · February 28, 2023, 3:20am

@nitnamby

The option is to convert HTML document into .txt format. You can then read the saved .txt file and it will be similar to extracting text from the HTML. It would involve extra steps i.e. reading text files from the disk, but it would serve the purpose. Please let us know whether it suits you or not. We will log an investigation ticket to further investigate your requirements and share the ID with you.

nitnamby · February 28, 2023, 9:18am

I don’t want to write it to disk as my application doesn’t store the extracted text, the application only need to return the extracted text as part of the api response. Also the aspose pdf and word libraries provides api to extract text directly, not sure why the html library doesn’t give that functionality.

asad.ali · February 28, 2023, 5:15pm

@nitnamby

We have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.

Issue ID(s): HTMLJAVA-1442

You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.

nitnamby · May 3, 2023, 10:20am

Can I get an update on this?

asad.ali · May 3, 2023, 2:33pm

@nitnamby

We are afraid that the earlier logged ticket has not been yet resolved. However, we will surely inform you as soon as we have some updates in this regard. Please be patient and spare us some time.

We are sorry for the inconvenience.