We're sorry Aspose doesn't work properply without JavaScript enabled.

Free Support Forum - aspose.com

Reading Microsoft Word

Can you please tell me how can I read data from a Microsoft word document?

Hi

Thank you for your interest in Aspose.Words. To extract text from Word document, i think the following article could be useful for you:

http://www.aspose.com/documentation/file-format-components/aspose.words-for-.net-and-java/retrieving-plain-text.html

Please, you can find lot of examples in Aspose.Words programmer’s guide:

http://www.aspose.com/documentation/file-format-components/aspose.words-for-.net-and-java/programmers-guide.html

Hope this helps. Please let me know in case of any issue.


Your examples did not give me what I am looking for.. all I need is:

Employee ID: 00052 // this is in the word document

I need to search for Employee ID: and return 00052 as a string..

How can I accomplish it? please provide a code example.. Thank You

Hi,


Thanks for you inquiry. Please note that when you load a Word document into Aspose.Words, a Document Object Model (DOM) is generated in memory and by using the classes of the Aspose.Words DOM, you can obtain detailed programmatic access to document elements and formatting. The DOM also allows you to programmatically read, manipulate and modify content and formatting of a Word document. For more information on this topic, please visit the following link:
http://www.aspose.com/documentation/.net-components/aspose.words-for-.net/object-model-overview.html

Moreover, could you please attach your input Word document here for testing? I will check the structure of your document and provide you more information.

Best Regards,

Thank you for your help.. I am attaching the file as requested..

this is a file created in MS word which I first convert it Aspose word. what I need to be able to do is read Name and ENum.. Please provide sample code if possible

Thank You.

Hi,


Thanks for your inquiry. First of all, please note that DocumentExplorer is a very useful tool which easily enables us to see the entire document structure. You can find DocumentExplorer in the folder where you installed Aspose.Words e.g. C:\Program Files (x86)\Aspose\Aspose.Words for .NET\Demos\CSharp\DocumentExplorer\bin\DocumentExplorer.exe. Below is the DOM structure of your document as viewed with DocumentExplorer:



Moreover, you can read Name and Enum values by using the following code snippet:

Document
doc = new Document(@“c:\test\2011+Interview+Report.doc”);

NodeCollection paragraphs = doc.FirstSection.Body.GetChildNodes(NodeType.Paragraph, true);

Paragraph
namePara = paragraphs[3] as Paragraph;
Paragraph
enumPara = paragraphs[4] as Paragraph;

Console.WriteLine(namePara.Runs[1].Text);
Console.WriteLine(enumPara.Runs[1].Text);

I hope, this will help.

Best Regards,

Thank you for your replay… it is very helpfull unfortunetly this approach does not do what I need due to the fact that I can never know where exactly the Name or Enum will be in the document… what I need to do is search for Name or Enum (because they will always be static text in the document) anywhere in the document and return their value…Please let me know how can I accomplish that.

I found the code below in Java... can you tell me how can I make it work with C#. This is exactly what I need..

// Open document.<?xml:namespace prefix = o ns = "urn:schemas-microsoft-com:office:office" />

Document doc =new Document(@"Test001\in.doc");

// Get product ID. In the document we have text like this: Product ID: 00000

// It is easy to find value of the Product ID using regular expressions.

Regex regex = new Regex(@"Product ID: (?\d+)", RegexOptions.IgnoreCase);

doc.Range.Replace(regex, new ReplaceEvaluator(GetID), true);

=====================================================================

private static ReplaceAction GetID(object sender, ReplaceEvaluatorArgs e)

{

// Get ID.

string id = e.Match.Groups["id"].Value;

Console.WriteLine(id);

return ReplaceAction.Skip;

}

Hi,


Thanks for your inquiry.
You can utilize user defined replace evaluator method in c# as below:


// Open document.<o:p></o:p>

Document doc = new Document("D:/temp/temp.docx");

// Get product ID. In the document we have text like this: Product ID: 00000

// It is easy to find value of the Product ID using regular expressions.

Regex regex = new Regex(@"Product ID: (?\d+)", RegexOptions.IgnoreCase);

//doc.Range.Replace(regex, new ReplaceEvaluator(GetID), true);

doc.Range.Replace(regex, new InsertDocumentAtReplaceHandler(), false);


public class InsertDocumentAtReplaceHandler : IReplacingCallback

{

ReplaceAction IReplacingCallback.Replacing(ReplacingArgs e)

{

// Get ID.

string id = e.Match.Groups["id"].Value;

Console.WriteLine(id);

return ReplaceAction.Skip;

}

}


In case of any ambiguity, please let me know.

This looks very promising.. although I tested it with my attached file and it did not gave me the message.. can you try it and let me know if it works for you? this is what I have:

public class InsertDocumentAtReplaceHandler : IReplacingCallback

{

ReplaceAction IReplacingCallback.Replacing(ReplacingArgs e)

{

// Get ID.

string id = e.Match.Groups["id"].Value;

MessageBox.Show(id);

return ReplaceAction.Skip;

}

Aspose.Words.Document doc = new Aspose.Words.Document(myAttached file);

Regex regex = new Regex(@"ENum: (?\d+)", RegexOptions.IgnoreCase);

doc.Range.Replace(regex, new InsertDocumentAtReplaceHandler(), false);

if you look at my attached file I should be seeing : 03623.. Thanks again for your time

Hi,


Thanks for your inquiry. I have already tested with sample document having text like “Production ID: 00000” in your attached document. I observe a tab(\t) key within ENum and ID.

Please use following line:
Regex regex = new Regex(@“ENum:\t(?\d+)”, RegexOptions.IgnoreCase);

Hope this will help you.

<o:p></o:p>

I modified my code as suggested and I am still not able to read my text.. I am putting a break point in the class and It never gets there.. is any reason why? Please advice.. Thanks

Hi,


Thanks for your inquiry. Please test this code with new simple application. I have attached code snippet along with output screen. I am using latest Aspose.Words 10.8 ( .NET 2.0 ).

Input Word Document (Your provided document) : 2011 Interview Report.doc

Hope this will help you. In case of any ambiguity, please share further states like Aspose.Words version, .NET Framework etc…

This worked as a charm.. altough I have another request.. how can I search multiple strings? lets say I want to search for Enum and Name

how can I modify my Regex pattern? this is my current pattern to search for Enum:

Regex regex = new Regex(@"ENum:\t(\d+)",RegexOptions.IgnoreCase);

how can I search for multiple fields.. maybe name,group . etc?

Please Advise.

Thank You.

Hi,


Thanks for your inquiry. You can extract additional information like Name, Office and group etc.

Using following code snippet:

// Open
document.
<o:p></o:p>

Document doc = new Document("D:/temp/2011InterviewReport.doc");

Regex regex = new Regex(@"(?[a-z]{2,3} +[a-z]*|[\w'-]*):\t(?[a-z]{2,3} +[a-z]*|[\w'-, ]*)", RegexOptions.IgnoreCase);

doc.Range.Replace(regex, new InsertDocumentAtReplaceHandler(), false);


public class InsertDocumentAtReplaceHandler : IReplacingCallback

{

ReplaceAction IReplacingCallback.Replacing(ReplacingArgs e)

{

// Get ID.

string id = e.Match.Groups["id"].Value;

string title = e.Match.Groups["title"].Value.Trim();

Console.WriteLine(title +":\t\t"+ id);

return ReplaceAction.Skip;

}

}


Hope this will help.