Convert html to text

marchuber · April 9, 2018, 8:53am

How to convert html to text with aspose.html and not with aspose.words? Could you please send us a sample code? Thanks.

Farhan.Raza · April 9, 2018, 6:16pm

Thank you for contacting support.

We would like to share with you that Aspose.HTML API does not support this feature at the moment. However, a feature request with ID HTMLNET-1131 has been logged in our issue management system. The issue ID has been linked with this thread so that you will receive notification as soon as the issue is resolved.

In case you do not want to use Aspose.Words API, you can try an alternative approach. An HTML file can be converted to a PDF document using Aspose.HTML as well as Aspose.PDF API. Resultant PDF file can than be converted to a TXT file using Aspose.PDF API in your environment. Please refer to below help topics for your kind reference.

We hope this will be helpful. Please feel free to contact us if you need any further assistance.

Farhan.Raza · April 16, 2018, 5:16am

@marchuber

Thank you for being patient.

We have investigated the ticket HTMLNET-1131 and would like to share with you that, Aspose.HTML does not have any particular object like a TextExtractor. However, Aspose.HTML is based on Document Object Model (DOM) which is common not only for HTML, but for XML as well. So it is easy to iterate through the DOM nodes and extract the textual content from it. Below is a demonstration of how this can be achieved:

class Program
{
    static void Main(string[] args)
{
        // Prepare HTML file
        var htmlFileContent = @"
<html>
<head>
<style> * { color: red; } </style>
<body>
<p>Text.
<span>Another text block.
<div>One more text block.";
        File.WriteAllText("simple.html", htmlFileContent);

        // Create an instance of HTML document
        using (var document  = new HTMLDocument("simple.html"))
        {
            // The first way of gathering text elements from document
            // Initialize the instance of node iterator (https://reference.aspose.com/net/html/aspose.html.dom.traversal/inodeiterator) that allows to navigate across HTML DOM
            INodeIterator iterator = document.CreateNodeIterator(document, NodeFilter.SHOW_TEXT, new StyleFilter());
            StringBuilder sb = new StringBuilder();
            Node node;
            while ((node = iterator.NextNode()) != null)
                sb.Append(node.NodeValue);
            Console.WriteLine(sb.ToString());

            // The second way of gathering text elements from document by using user method
            Console.WriteLine("----------------");
            Console.WriteLine(GetContent(document.Body));


            // The third way of gathering text elements from document by using TextContent property
            Console.WriteLine("----------------");
            Console.WriteLine(document.Body.TextContent);
        }
    }

    static string GetContent(Node node)
    {
        StringBuilder sb = new StringBuilder();
        foreach (var n in node.ChildNodes)
        {
            if (n.NodeType == Node.ELEMENT_NODE)
                sb.Append(GetContent(n));
            else if (n.NodeType == Node.TEXT_NODE)
                sb.Append(n.NodeValue);
        }
        return sb.ToString();
    }
}

/// <summary>
///  Represents a user filter created in order to ignore content of the 'style' element.
/// </summary>
class StyleFilter : NodeFilter
{
    public override short AcceptNode(Node n)
    {
        return n.ParentElement.TagName == "STYLE" ? FILTER_REJECT : FILTER_ACCEPT;
    }
}

In case this does not satisfy your requirements then please elaborate your requirements so that we may help you accordingly.

marchuber · April 16, 2018, 6:40am

Thanks a lot, it works so far.

How can I remove <script> and remark // blocks as in the attached image.

Code:

using (var document = new HTMLDocument(stream, ""))
{
 // The first way of gathering text elements from document
 // Initialize the instance of node iterator (https://reference.aspose.com/net/html/aspose.html.dom.traversal/inodeiterator) that allows to navigate across HTML DOM
 INodeIterator iterator = document.CreateNodeIterator(document, NodeFilter.SHOW_TEXT, new StyleFilter());
 StringBuilder sb = new StringBuilder();
 Html.Dom.Node node;
 while ((node = iterator.NextNode()) != null)
    sb.Append(" " + node.NodeValue);
              
outputPath = Path.Combine(outputRootPath, Path.GetFileNameWithoutExtension(path) + " " + DateTime.Now.Ticks.ToString() + ".txt");
File.WriteAllText(outputPath, sb.ToString());
}

image.png (50.2 KB)
TestHtml.zip (1.6 KB)

Farhan.Raza · April 16, 2018, 4:43pm

@marchuber

We would like to request you to change below line of code in override method AcceptNode to ignore the content in <script> tag.

return (n.ParentElement.TagName == "STYLE" || n.ParentElement.TagName == "SCRIPT" ? FILTER_REJECT : FILTER_ACCEPT);

We hope this will be helpful. Please feel to contact us if you need any further assistance.