Aspose Word extracting text with tables header/footer

uzair · May 30, 2016, 8:48am

I want to extract text from Microsoft Office documents. I need to get text from header footer tables seperately. For this i wrote a custom class extending DocumentVisitor class and override the methods to capture text from header footer and table. When i compare the text i get from my custom class with plain text from Document.Save method, both are not same. If the file has bulleted list the bullets are missing in text from my custom class.

For example look at the files attached. If i convert t1doc.doc to text using Document.Save method i am getting t1doc.txt file. In this file Header footer and table text is present, but there is no indication in the file regarding where is the Header text or Footer text or Table text.

To solve this i wrote a custom class extending DocumentVisitor class. But it has the problems i mentioned earlier

Does any one know better way to get text with header footer tables information?

tahir.manzoor · May 31, 2016, 7:13am

Hi Uzair,

Thanks for your inquiry. You can extract the text of header, footer and table separately using following code example. Hope this helps you. Please let us know if you have any more queries.

If you still face problem, please share your expected output text file here for our reference. We will then provide you more information about your query.

Document doc = new Document(MyDir + "t1doc.doc");
TxtSaveOptions options = new TxtSaveOptions();
options.PreserveTableLayout = true;
String headerText = "", footerText = "", tabletext = "";
foreach (Section section in doc.Sections)
{
    foreach (HeaderFooter headerfooter in section.HeadersFooters)
    {
        if (headerfooter.HeaderFooterType == HeaderFooterType.HeaderPrimary)
        {
            headerText += headerfooter.ToString(options);
        }
        else if (headerfooter.HeaderFooterType == HeaderFooterType.FooterPrimary)
        {
            footerText += headerfooter.ToString(options);
        }
    }
    foreach (Table table in section.Body.Tables)
    {
        tabletext += table.ToString(options);
    }
}

uzair · June 1, 2016, 6:30am

Hi

Thanks for the reply. The code u shared works fine. But my requirement is more than just that.

For example consider the Document.Save method. If i convert a document to plain text using this method what i will get is a simple plain text file. If the document has tables header footer and some other normal paragraphs, that plain text file will include complete text. But it doesn’t indicate which text belongs to header which text belongs to footer and which text belongs to tables (or rows or cells)

So what i am trying to do is get complete text content of the document but with the extra information like header/footer table etc. And the order of text should be same as the order of text in the document. How can we achieve this with Aspose Word?

Waiting for the reply

uzair · June 1, 2016, 6:51am

Actually to achieve this i used the example code ExtractContentUsingDocumentVisitor. I modified the class and override some more methods like VisitHeaderFooterStart, VisitTableStart, VisitRowStart etc. I am writing output as an XML file which has tags for header footer table etc.

I have attached some files with this post.
Document.docx is the input document.
Document.docx_XML.xml is the output XML i am getting using my class which extends DocumentVisitor.
Document.docx_TextSave.txt is the output of Document.Save method.

If you look closely there are fer problems

the first 3 lines have numbering which is extracted in the text file but the XML file doesnt have the numbering. It only has the text from those lines.
email address is extracted normally in text file but in XML file it has some extra text which is not present in document.

I need to solve these problems and get the XML output exactly as the text output. Any help?

tahir.manzoor · June 2, 2016, 3:24am

Hi Uzair,

Thanks for your inquiry. Please create a standalone console application (source code without compilation errors) that helps us to reproduce your problem on our end and attach it here for testing. We will investigate the issue on our side and provide you more information.

uzair · June 2, 2016, 3:58am

Hi

I attached the console application. Just add reference to the Aspose Word dll and you can run it. I used VS 2013. Look at the folders in bin\Debug
TestCase-Contains the test file Document.docx
xmls-Generated XML stored in this folder with the same file name as input file
texts-Text file generated from doc.Save method is saved in this folder
textsFromXML-Writing the text content from XML to a text file in this folder

In the end compare the 2 text files in ‘texts’ and ‘textsFromXML’ folders.You will see how they are different. I want to get the XML text exactly as it is inside the text file of doc.Save method

tahir.manzoor · June 3, 2016, 4:32am

Hi Uzair,

Thanks for sharing the detail. Please do the following modifications in your code to get the desired output.

Call Document.UpdateListLabels method after importing the document into Aspose.Word DOM.
Document doc = new Document(file);
doc.UpdateListLabels();
Remove VisitRun from the code.
Use following modified code in VisitParagraphStart.

We have attached ExtractContentUsingDocumentVisitor.cs with this post. Hope this helps you.

public override VisitorAction VisitParagraphStart(Paragraph paragraph)
{
    mBuilder.Append("*** Para Started ***\r\n");
    writer.WriteStartElement("Para");
    paraText = new StringBuilder();
    Console.WriteLine(paragraph.ToString(SaveFormat.Text));
    AppendText(paragraph.ToString(SaveFormat.Text));
    paraText.Append(RemoveInvalidXMLChars(paragraph.ToString(SaveFormat.Text)));
    return VisitorAction.Continue;
}