Extract PlainText From HTML

Hello

We want to extract the text from mail in HTML format.
It seems that when the document have a lot of images the conversion is very slow.
Is there a way to specify to don’t load the images like the SkipInlineImages in Aspose.Email.SaveOptions.DefaultMhtml ?

Thanks you

Our convert method :
public static int ExtractText(string inputFile, string outputFile, ref string errmsg)
{
try
{
// The specified file can be opened in MS Word when opening it with Aspose.Words here.
// Open the document.
FileStream docStream = new FileStream(inputFile, FileMode.Open, FileAccess.Read, FileShare.ReadWrite);
try
{
Document doc = new Document(docStream);
doc.Save(outputFile, SaveFormat.Text);
doc.Cleanup();
return (0);
}
finally
{
// Close stream
docStream.Close();
docStream.Dispose();
GC.Collect();
}
}
catch (Exception ex)
{
errmsg = ex.Message + " | " + ex.StackTrace;
GC.Collect();
return (1);
}
}

@tparassin

Could you please ZIP and attach your input MHTML/HTML and expected output? We will then provide you more information about your query along with code.

documents.zip (3.7 KB)
Here is two examples. The output made by Aspose is good. But it’s taking time, and I think it’s because of the pictures in the mail.
As we just want the text, i would like to know if there is a way to ignore these images to speed up the conversion

If there is no image the conversion is very quick.

Thanks.

@tparassin

Please use the following code example to export the HTML to TXT and ignore the images.

Aspose.Words.LoadOptions loadOptions = new Aspose.Words.LoadOptions();
loadOptions.ResourceLoadingCallback = new ImageHandler();
Document doc = new Document(MyDir + "1220917.html", loadOptions);
doc.Save(MyDir + "20.7.txt");

private class ImageHandler : IResourceLoadingCallback
{
    public ResourceLoadingAction ResourceLoading(ResourceLoadingArgs args)
    {
        Console.WriteLine(args.OriginalUri);
        return ResourceLoadingAction.Skip;
    }
}