Text from Word document


#1

We are in the IT placement business.

We are trying to automate the processing of incomming resumes sent in by people applying for jobs.

We need to do two things for now...

1: Try to programatically get a vaild first and last name of the applicant by scraping it out of their resume. (I have a fairly successful method worked out.)

2: Pull out the resume text and convert it to an RTF document to store in a database. (At some point in the future we will not do this, we will just store the Word document.)

When I use the GetText() method on the Document object or on individual Paragraph objects I get a lot of non-text items. I end up with header, footer, embedded graphic items.

Is there a way to just get the TEXT?

Kyle!


#2

After reviewing more of the documentation, it looks like I'll have to implement the DocumentVisitor class.

Kyle!


#3

Hi,

Thank you for considering Aspose.

DocumentVisitor actually just enumerates all child nodes of the node that accepts it and then calls corresponding methods. So you can either use it or work directly with the document nodes. To be honest, I haven't fully understood why use of GetText doesn't suit you. Please provide more details of the problem.

Regarding export to RTF, it is unfortunately not supported at the moment.


#4

Here is a sample from one of the resumes....

BABAK AALEMANSOUR \rPage PAGE 2\r\r\rEvaluation only, garbage text in the document is part of the evaluation watermark. Created with Aspose.Word. Copyright 2003-2005 Aspose Pty Ltd.\r?^?xf EMBED CPaint5 \r\rPRINCETON INFORMATION\r\rBABAK AALEMANSOUR\r\rSUMMARY:\tComputer Professional with five years experience in Desktop Support, installation, configuration and maintenance in Windows NT and Novell environments.\r\rSKILLS:\tNetworks:\tEthernet, Token Ring, Dialup Network \r

You will notice that the first element is a page header, then I get your Eval message...which I know will go away when I license the product, then there is some garbage, then I get info about an embedded image, then finally I get the start of the actual text of the document.

I only want the text from the body of the document...I don't want all of the other stuff.

I guess we only want the text from Document/Section/Body/Paragraph/Runs, need to ignore headers/footers, etc.

Kyle!


#5

I have implemented DocumentVisitor and I'm visiting runs using run.GetText() to store the text of all of the runs in a StringBuilder object.

I get different slightly results when processing the same file multiple times.

I sometime get some garbage characters embedded in the text.

Kyle!


#6

You can also try using Document.Save(fileName, SaveFormat.FormatText), there is an overload to save into a file or stream. This method does just that - uses a DocumentVisitor internally to strip out everything apart text characters.

I've just added an article for this question, see http://www.aspose.com/Wiki/default.aspx/Aspose.Word/Home_Page.html, Extracting Text Only.

Garbage text is part of the evaluation watermark, see Evaluation Version and Licensing in the same link above.


#7

We are now licensed so the random characters and gone.

The Document.Save(stream, SaveFormat.Text) is ok, but I would like to not get page headers or footers.

Seems like I'll have to rework my original DocumentVisitor object.

Kyle!