Extracting text from the main story excluding headers and footers

romank · December 8, 2004, 5:26pm

I have download your product and am evaluating it for us with a customers application. They do Medical Transcription and need to get the character count with spaces of the document. Currently they open each document in Word and use the “Characters (with spaces)” value under Tools–>Word Count on the menu. We are looking to automate this process in a new application.

The problem I am having is when I open a document and call the Text property of the Range object I get the page header for each page insert into the text. These page headers are not counted by Words character counts but show up when I read the document with your product. Is there any why to get just the body and not any text from page headers?

romank · December 8, 2004, 5:28pm

You can create a simple class that implements the DocumentVisitor interface.

Provide implementations for StoryStart and StoryEnd methods. In these methods check when you get start and end of the main story and set a boolean flag that controls whether you are counting the text or not.

Provide implementation for RunOfText and accumulate text or length of text as you get called with every piece of text found in the document.

romank · December 26, 2004, 2:11am

Sorry I have been away from this project helping out a differant client for a few weeks. I don’t understand your solution. I really can not find much documentation on implementing the interfaces you speak of, but more important I don’t understand how to find out the text to exclude. The headers of the files are differant leaderhead for differant doctors. I don’t know what text is any given document. Word somehow can tell the differance because it grays out the headers and does not count them. Can you provide me a bit more information?

Brad

romank · December 26, 2004, 2:16am

Hi Brad,

You right, there is not much documentation about IDocumentVisitor yet as it is recent addition to the API and it is still being worked on. But it’s quite easy to use, see the example below.

/// 
/// Sample class that shows how to implement IDocumentVisitor
/// to extract text from the document body excluding headers and footers.
/// 
public class MainStoryExtractingVisitor : IDocumentVisitor
{
    void IDocumentVisitor.DocumentStart(Document doc)
    {
        extractedText = new System.Text.StringBuilder();
    }

    void IDocumentVisitor.DocumentEnd()
    {
        //Do nothing.
    }

    void IDocumentVisitor.SectionStart(PageSetup pageSetup)
    {
        //Do nothing.
    }

    void IDocumentVisitor.SectionEnd()
    {
        //Do nothing.
    }

    void IDocumentVisitor.StoryStart(StoryType storyType)
    {
        isExtracing = (storyType == StoryType.MainTextStory);
    }

    void IDocumentVisitor.StoryEnd()
    {
        isExtracting = false;
    }

    void IDocumentVisitor.ParagraphStart(ParagraphFormat paragraphFormat)
    {
        //Do nothing.
    }

    void IDocumentVisitor.ParagraphEnd()
    {
        //Do nothing.
    }

    void IDocumentVisitor.RunOfText(Font font, string text)
    {
        if (isExtracting)
            extractedText.Append(text);
    }

    public string GetExtractedText()
    {
        return extractedText.ToString();
    }

    private bool isExtracting;
    private System.Text.StringBuilder extractedText;
}

public void ExtractMainStory()
{
    Document doc = TestUtil.Open(@"C:\MyDoc.doc");
    MainStoryExtractingVisitor visitor = new MainStoryExtractingVisitor();
    doc.Accept(visitor);
    Console.Writeln(visitor.GetExtractedText());
}