We're sorry Aspose doesn't work properply without JavaScript enabled.

Free Support Forum - aspose.com

Data Mining - Extraction from Word document

I am trying to extract data from a standard MS Wod document. I’ve attached a document for you to see. In the Personal Statement area I need to pull the text infirmation down to the list items and then pull the list items individually . I’ve search many of your posts but can’t seem to fine a link that is current. Is this possible with ASPOSE WORDS?

biosketch-sample.zip (33.7 KB)

@morgenweck,

Thanks for your inquiry. With Aspose.Words, you can modify the Word document. You can use Node.Clone method to clone the paragraph and insert it after the desired paragraph e.g. list item using CompositeNode.InsertAfter method. This method inserts the specified node immediately after the specified reference node. Paragraph.IsListItem returns true when the paragraph is an item in a bulleted or numbered list in original revision.

If you still face problem, please ZIP and attach your expected output Word document here for our reference. We will then provide you more information about your query along with code.

Thank you for your reply. I’m not looking to create or modify a document but rather extract all of it’s pieces in some sort of fashion so I can put into a database. What I would love to get from the sample that was uploaded something like:

Dim Name as string = “Hunt, Morgan Casey” based upon finding what is after “NAME:”
Dim USERNAME as String = “huntmx” based up what is after “COMMONS USER NAME”
Dim POSITION as string = “Associate Professor of Psychology” based upon what is after "POSITION TITLE: "

Read table and loop the rows to get the data

Dim t0r0c0 as string = TableCell0 = “University of California, Berkeley”
Dim t0r0c1 as string = TableCell0 = “BS”
Dim t0r0c2 as string = TableCell0 = “05/1990”
Dim t0r0c3 as string = TableCell0 = “Psychology”

Same for other rows

Then

Dim PS as string = All of the paragraphs in the A.Personal Statement section up to the Bulleted list

Then read the list of items of

List item 0 =“Merryle, R.J. & Hunt, M.C. (2004). Independent living, physical disability and substance abuse among the elderly. Psychology and Aging, 23(4), 10-22.”

List item 2 = “Hunt, M.C., Jensen, J.L. & Crenshaw, W. (2007). Substance abuse and mental health among community-dwelling elderly. International Journal of Geriatric Psychiatry, 24(9), 1124-1135.”

list item 3= “Hunt, M.C., Wiechelt, S.A. & Merryle, R. (2008). Predicting the substance-abuse treatment needs of an aging population. American Journal of Public Health, 45(2), 236-245. PMCID: PMC9162292 Hunt, M.C., Newlin, D.B. & Fishbein, D. (2009). Brain imaging in methamphetamine abusers across the life-span. Gerontology, 46(3), 122-145.”

I need to extract all of the sections- does this make sense? Is this possible?

Thanks

I got the table cell value with


That was a great help. Still working on the other items

@Bill,

Thanks for your inquiry. Yes, you can extract the content from document and save them according to your requirement. We suggest you please read about Aspose.Words document object model from here:
Aspose.Words Document Object Model