Convert different file types to plan text for indexing in lucene search engine

What is the easiest and fastest way (code) to get the plain text (.txt) from these file types:

  • Word (.doc, .dot, .docx, .dotx, .docm, .dotm, .rtf, .wpd)
  • Excel (.xls, .xlsx, .xlsm, .xlsb, .xlt, .xltx, .xltm, .csv)
  • Powerpoint (.ppt, .pptx, .pptm, .pps, .ppsx, .ppsm, .pot, .potx, .potm)
  • Visio (.vsd, .vsdx, .vsdm, .svg)
  • Publisher (.pub)
  • Outlook (.msg, .vcf, .ics)
  • Project (.mpp)
  • OpenOffice (.odt, .odp, .ods)
  • HTML
  • XML

We need to have the plain text for indexing in lucene for our search engine.

@marchuber,

Thank you for your inquiry. You may convert almost all the formats that you have mentioned to text using Aspose APIs. Documents can be directly converted using a specific Aspose component and sometimes by combination of different Aspose components. Following are the details.

You may convert Excel to TXT or CSV file formats to obtain the plain text. Visit the link Saving Workbook to Text or CSV Format for details.

You may convert Word to TXT file format using Aspose.Words. Sample is given below:

Document doc = new Document("in.docx");
TxtSaveOptions opts = new TxtSaveOptions();
doc.Save(MyDir + @"out.txt",  opts);

Similarly you can convert HTML to TXT format using Aspose.Words.

You can use the following code to extract plain text from Visio file by using Aspose.Diagram:

FormatTxtCollection collection = shape.Text.Value;
StringBuilder plainText = new StringBuilder();
foreach (object text in collection)
{
    if (text is Txt)
    {
        plainText.Append((text as Txt).Value);
    }
}
Console.WriteLine(plainText.ToString());

You can convert a MSG file to TXT format using combination of Aspose.Email and Aspose.Words. Following is the sample code snippet:

Aspose.Email.MailMessage message = Aspose.Email.MailMessage.Load(“inputfile”);
message.TimeZoneOffset = TimeZone.CurrentTimeZone.GetUtcOffset(message.Date);

        Aspose.Email.MhtSaveOptions mhtSaveOptions = new Aspose.Email.MhtSaveOptions
        {
            MhtFormatOptions = Aspose.Email.MhtFormatOptions.WriteHeader | Aspose.Email.MhtFormatOptions.WriteCompleteEmailAddress
        };
        mhtSaveOptions.SkipInlineImages = false;

        //Export to MHT format
        message.Save(@"_msg.mht", Aspose.Email.SaveOptions.DefaultMhtml);

         System.IO.MemoryStream msgStream = new System.IO.MemoryStream();
        message.Save(msgStream, mhtSaveOptions);
        msgStream.Position = 0;

      var options = new Aspose.Words.LoadOptions()
        {
            LoadFormat = Aspose.Words.LoadFormat.Mhtml
        };
        
        var document = new Aspose.Words.Document(msgStream, options);
        document.Save(@"_msg.txt");

The attached code does not work in .net? Could you help me please. Thanks. I need the whole plain text from all Visio pages.

Image.png (11.5 KB)

1.) How to exclude the masterpages from the code below?
2.) Is there a faster way with .net components (GetPresentationText)? Do you have an example?

//Extract text from pptx
Slides.Presentation pptxPresentation = new Slides.Presentation(“C:/SVN_Incite/Prototypes/AsposePdfConversion/Aspose/Files/TestEinfach.pptx”);
ITextFrame[] textFramesPPTX = Aspose.Slides.Util.SlideUtil.GetAllTextFrames(pptxPresentation, true);
for (int i = 0; i < textFramesPPTX.Length; i++)
foreach (IParagraph para in textFramesPPTX[i].Paragraphs)
foreach (IPortion port in para.Portions)
{
Console.WriteLine(port.Text);
}

@marchuber,

I have observed the sample code and like to share that used sample code is infact the fastest approach that you have adopted. The other approach is to traverse every single slide and their respective shapes to identify text frames. The second approach is more time consuming as compared to the one that you have already used on your end.

@marchuber,

You can modify your code as follows to make it work for Visio.

CODE:

Diagram.Diagram objDiagram = new Diagram.Diagram("infilepath");
foreach (Aspose.Diagram.Shape shape in diagram.Pages[0].Shapes)
{
FormatTxtCollection collection = shape.Text.Value;
StringBuilder plainText = new StringBuilder();

foreach (object text in collection)
{
    if (text is Txt)
    {
        plainText.Append((text as Txt).Value);
    }
}

}
Console.WriteLine(plainText.ToString());

Thanks for your reply. The following question is still open:

1.) How can we exclude the masterpages from Powerpoint?

Thanks a lot it works. I only hat to change diagram.Pages… to objDiagram.Pages…

How can we get the text from all Visio pages?

@marchuber,

Can you please elaborate your requirements for excluding the master pages from PowerPoint. Please also share the source and desired output presentation along with requirement details with us so that we may help you. If you ought to remove the Master slides, Aspose.Slides does help you to remove the master slide or unused master slides (if master is not used by any slide) from presentation. The following API reference links shall be helpful.

We would like to have only the text from the slides and not the text from masterpages. In the attached sample we simple want to extract the text “Page1 ö ä ü” and not “Mastertitelformat bearbeiten…”.

Code:
// .txt
dstStream = new MemoryStream();
ITextFrame[] textFramesPPTX = Slides.Util.SlideUtil.GetAllTextFrames(presentation, true);
string text = “”;

                    outputPath = Path.Combine(outputRootPath, Path.GetFileNameWithoutExtension(path) + " " + DateTime.Now.Ticks.ToString() + ".txt");
                    for (int i = 0; i < textFramesPPTX.Length; i++)
                    {
                        foreach (IParagraph para in textFramesPPTX[i].Paragraphs)
                        {
                            foreach (IPortion port in para.Portions)
                            {
                                if (!string.IsNullOrEmpty(port.Text))
                                {
                                    text += port.Text + " ";
                                }
                            }
                        }
                    }
                    File.WriteAllText(outputPath, text);

TextPptx.zip (24.4 KB)
Image.png (171.6 KB)

@marchuber,

I have observed your requirements and suggest you to please set second parameter in function call to false in following sample code line on your end.

Please visit this documentation link for your convenience as well where when you set the second parameter to false then it means no text to be extracted from master slides. I hope this will be helpful.

Works perfect :o). Thanks a lot.

@marchuber,

It is good to know that things are working at your end. Thank you for update.