Best way to parse a word document?

jpollar · October 31, 2012, 2:12pm

I am trying to parse word documents that contain job requisitions into XML.
The problem I’m having is the fields are not very standard.
For example, I’ll get one job with the current format
----------------------------------------------------
JobID: 123
Title: .Net Developer
Description: blah blah blah
Mandatory Skills: blah blah blah
----------------------------------------------------
Desired XML Result:
123
.Net Developer
blah blah blah
blah blah blah
----------------------------------------------------
And then the next one might look like this
----------------------------------------------------
JobID: 222
Title: SharePoint Developer
Rate: $80/hr
Description: blah blah blah
blah blah blah
----------------------------------------------------
Desired XML Result:
222
SharePoint Developer
$80/hr
blah blah blah
blah blah blah
----------------------------------------------------
As you can see, each line that starts with a descriptor of the data is followed by a colon.
I need a way to create an XML tag for each descriptor and put its value inside.
Any ideas???

awais.hafeez · November 1, 2012, 12:15pm

Hi James,

Thanks for your inquiry. Could you please attach your sample Word document here for testing? I will investigate the structure of your document and provide you code to achieve what you’re looking for.

Best Regards,

jpollar · November 1, 2012, 12:38pm

Attached.
Thanks for your help.

awais.hafeez · November 2, 2012, 11:33am

Hi James,

Thanks for your inquiry. The easy way to parse this document, I can suggest you, is first convert your document to TXT format, then read the TXT file line by line and search for a colon characters within each line and split the line into left and right parts. That is If an occurrence of a colon character is found, keep the left part (from colon to the line beginning) as an XML Node and store the right part as its data until you found another colon in some other line. I hope, this helps.

Best Regards,

jpollar · November 2, 2012, 11:54am

I’m already able load this into an object using Aspose and parse it using search functions for each node using keywords.
The problem is that it’s not the most efficient way.
I don’t know how to do what you describe below and get the desired results. That’s the entire problem. Thanks for trying though.
“That is If an occurrence of a colon character is found, keep the left part (from colon to the line beginning) as an XML Node and store the right part as its data until you found another colon in some other line.”

awais.hafeez · November 5, 2012, 7:15am

Hi James,

Thanks for your inquiry. Here is how you can roughly tokenize your template and build a HashTable. You can then serialize this HashTable into an XmlDocument. Please see the following code snippet:

Document doc = new Document(@"C:\Temp\JobID-2222.doc");
Hashtable pairs = new Hashtable();
StringBuilder sb = new StringBuilder();
NodeCollection paras = doc.GetChildNodes(NodeType.Paragraph, true);
for (int i = 0; i < paras.Count; i++)
{
    Paragraph p = (Paragraph) paras[i];
    string text = p.GetText();
    if (text.Contains(":"))
    {
        string[] splits = text.Split(new char[]
        {
            ':'
        });
        string currentKey = splits[0];
        pairs.Add(currentKey, string.Empty);
        sb.Clear();
        sb.Append(splits[1]);
        for (int j = i + 1; j < paras.Count; j++)
        {
            Paragraph p2 = (Paragraph) paras[j];
            string text2 = p2.GetText();
            if (text2.Contains(":"))
            {
                pairs[currentKey] = sb.ToString();
                i = j - 1;
                break;
            }
            else
            {
                sb.Append(text2);
            }
        }
    }
}

I hope, this helps.

Best Regards,

jpollar · November 5, 2012, 9:29am

Thanks.
I will give that a shot.

awais.hafeez · November 6, 2012, 9:12am

Hi James,

Sure, please let us know any time you any any further queries. We’re always glad to help you.

Best Regards,