Extract PlainText From HTML

tparassin · July 7, 2020, 7:35am

Hello

We want to extract the text from mail in HTML format.
It seems that when the document have a lot of images the conversion is very slow.
Is there a way to specify to don’t load the images like the SkipInlineImages in Aspose.Email.SaveOptions.DefaultMhtml ?

Thanks you

Our convert method :
public static int ExtractText(string inputFile, string outputFile, ref string errmsg)
{
try
{
// The specified file can be opened in MS Word when opening it with Aspose.Words here.
// Open the document.
FileStream docStream = new FileStream(inputFile, FileMode.Open, FileAccess.Read, FileShare.ReadWrite);
try
{
Document doc = new Document(docStream);
doc.Save(outputFile, SaveFormat.Text);
doc.Cleanup();
return (0);
}
finally
{
// Close stream
docStream.Close();
docStream.Dispose();
GC.Collect();
}
}
catch (Exception ex)
{
errmsg = ex.Message + " | " + ex.StackTrace;
GC.Collect();
return (1);
}
}

tahir.manzoor · July 7, 2020, 4:56pm

@tparassin

Could you please ZIP and attach your input MHTML/HTML and expected output? We will then provide you more information about your query along with code.

tparassin · July 8, 2020, 6:31am

documents.zip (3.7 KB)
Here is two examples. The output made by Aspose is good. But it’s taking time, and I think it’s because of the pictures in the mail.
As we just want the text, i would like to know if there is a way to ignore these images to speed up the conversion

If there is no image the conversion is very quick.

Thanks.

tahir.manzoor · July 8, 2020, 12:31pm

@tparassin

Please use the following code example to export the HTML to TXT and ignore the images.

Aspose.Words.LoadOptions loadOptions = new Aspose.Words.LoadOptions();
loadOptions.ResourceLoadingCallback = new ImageHandler();
Document doc = new Document(MyDir + "1220917.html", loadOptions);
doc.Save(MyDir + "20.7.txt");

private class ImageHandler : IResourceLoadingCallback
{
    public ResourceLoadingAction ResourceLoading(ResourceLoadingArgs args)
    {
        Console.WriteLine(args.OriginalUri);
        return ResourceLoadingAction.Skip;
    }
}