Find text strings and save to new table in Word

Hello, I have a well formatted Word document that describes a database/xml schema.
Each field in the document has a friendly name (heading 3 style), followed by XML Element:, Definition:, and Reason: (bold normal text) and text associated with each of these headings. I want to find all occurrences of the text, e.g. Element:, and then extract the text to a new table in a new document (with a view to eventually copying into an excel spreadsheet. Below is an example of what the document text looks like:
Family name
XML Element:
episodeDetails/Element:familyName

Definition:
The last or family name or surname …

How can I achieve this using aspose.words? I have a PDF version of the document as well - is it easier to do with aspose.PDF?
Thanks.

@pmick Could you please attach your sample input MS Word documents and expected output here for our reference? This will allow us to better understand your requirements. We will check your documents and provide you more information.

output_example.docx (12.7 KB)
Here are example files.

Schema_example.docx (13.7 KB)

Thanks,
Paul

@pmick Your document internal structure looks like this:

As you can see data in your document is in a table, so you can use code like this to parse such document:

Dictionary<string, string> keyValueCollection = new Dictionary<string, string>();

// Open the source document.
Document doc = new Document(@"C:\Temp\in.docx");

// Data in the source document is inside table. Rows with data has an embedded table with data we are interested in.
Table mainTable = doc.FirstSection.Body.Tables[0];
// The first row contains variable.
keyValueCollection.Add("Variable", mainTable.Rows[0].ToString(SaveFormat.Text).Trim());
// The second row contins variable name.
string row2Text = mainTable.Rows[1].ToString(SaveFormat.Text);
keyValueCollection.Add("Variable name", row2Text.Substring(row2Text.LastIndexOf(":") + 1).Trim());
// The third row is empty and the fourth contains definition.
string row4Text = mainTable.Rows[3].ToString(SaveFormat.Text);
keyValueCollection.Add("Definition", row4Text.Substring(row4Text.IndexOf(":") + 1).Trim());
// The 5th row is ampty and the 6th row contains reason.
string row6Text = mainTable.Rows[5].ToString(SaveFormat.Text);
keyValueCollection.Add("Reason", row6Text.Substring(row6Text.IndexOf(":") + 1).Trim());
// And the table in the 7th row contains data type definition.
// The data we are interested in is in the last row.
Table dataTypeTable = (Table)mainTable.Rows[6].GetChild(NodeType.Table, 0, true);
keyValueCollection.Add("Data type", dataTypeTable.LastRow.ToString(SaveFormat.Text).Trim());

// Now build a table with the extracted data.
Document result = new Document();
DocumentBuilder builder = new DocumentBuilder(result);
builder.StartTable();
foreach (string key in keyValueCollection.Keys)
{
    builder.InsertCell();
    builder.ParagraphFormat.Alignment = ParagraphAlignment.Center;
    builder.Write(key);
}
builder.EndRow();
foreach (string val in keyValueCollection.Values)
{
    builder.InsertCell();
    builder.ParagraphFormat.Alignment = ParagraphAlignment.Left;
    builder.Write(val);
}

// Save the result.
result.Save(@"C:\Temp\out.docx");

Here is the output: out.docx (7.3 KB)

Many thanks, will give it a try. Incidentally, how do you view the internal structure of the Word doc?

@pmick This is Aspose.Words DocumentExplorer demo application. It demonstrates how the document is represented in Aspose.Words DOM.

Hello Alexey,

I have got the code working fine on a single table. However, the document contains 160 field definitions all in the same table format. How would I loop through all the tables in the document to extract all definitions?

Many thanks,
Paul

@pmick If all the tables are in a single section in the document, you can use the following code to loop through the tables:

Document doc = new Document(@"C:\Temp\in.docx");
// loop through all tables in the main body of the first section.
foreach(Table table in doc.FirstSection.Body.Tables)
{
    // Do something with table.
}

Thanks for this. What I need to do is loop through each table, extract values and insert into new table. Tried this code but it doesn’t work as when the code processes the second table I get a message saying “Variable already exists” so I don’t think key/value pairs will work. Does it need an array of some type? I need to create the table column headings first and then loop through tables, extract values and insert values into table rows:

foreach (Table myTable in doc.FirstSection.Body.Tables)
{
    // Do something with table.
    // The first row contains variable.
    row1Text = myTable.Rows[0].ToString(SaveFormat.Text).Trim();
    keyValueCollection.Add("Variable", row1Text);
    // The second row contins variable name.
    row2Text = myTable.Rows[1].ToString(SaveFormat.Text);
    keyValueCollection.Add("Variable name", row2Text.Substring(row2Text.LastIndexOf(":") + 1).Trim());
}

I’ve started building an array which seems to work okay:
//declare array
string[,] MyArray = new string[3, 3];
string Myrow1Text;
string Myrow2Text;
string Myrow3Text;
int i = 0;
// loop through all tables in the main body of the first section.
foreach (Table myTable in doc.FirstSection.Body.Tables)
{
// Do something with table.
// The first row contains variable.
Myrow1Text = myTable.Rows[0].ToString(SaveFormat.Text).Trim();
MyArray[i, 0] = Myrow1Text;
// The second row contins variable name.
Myrow2Text = myTable.Rows[1].ToString(SaveFormat.Text);
Myrow2Text = Myrow2Text.Substring(Myrow2Text.LastIndexOf(":") + 1).Trim();
MyArray[i, 1] = Myrow2Text;
// The third row is empty and the fourth contains definition.
Myrow3Text = myTable.Rows[3].ToString(SaveFormat.Text);
Myrow3Text = Myrow3Text.Substring(Myrow3Text.IndexOf(":") + 1).Trim();
MyArray[i, 2] = Myrow3Text;
i = i + 1;
}

@pmick If there are known keys in your document, you can use DataTable with defined columns. Something like this:

// Define data table
DataTable dt = new DataTable();
dt.Columns.Add("First");
dt.Columns.Add("Second");
dt.Columns.Add("Third");

// Add rows.
DataRow dr = dt.NewRow();
dt.Rows.Add(dr);
dr["First"] = "Some value";
dr["Second"] = "Some value";
dr["Third"] = "Some value";

// Add so on in the loop for other rows...

Thanks, will give that a go.

1 Like

I’ve got my code working okay on my sample document but when I try running it on the full document the processing stops after 7 iterations. I thought there might be more than one section in the document but there only appears to be one. The loop stops after the “Ineligible for NHS” element in the document. Am I missing something obvious? Here is the code and I will attach the doc. Thanks for your help.

// Define data table
DataTable dt = new DataTable();
dt.Columns.Add("Variable");
dt.Columns.Add("Variable name");
dt.Columns.Add("Definition");
dt.Columns.Add("Reason");
try
{
    // Open the source document.
    Document doc = new Document(@"C:\Users\isspgm\OneDrive - University of Leeds\Documents\Lee Excel Dataset Schema manuals\PICANet Admission Schema Manual v1_7.docx");
    int sectionCount = doc.Sections.Count;
    string Myrow1Text; string Myrow2Text; string Myrow3Text; string Myrow4Text; int elementCount = 0;
    // loop through all tables in the main body of the first section.
    foreach (Table myTable in doc.FirstSection.Body.Tables)
    {
        // Do something with table.
        // The first row contains variable. 
        // Add new row
        elementCount += 1;
        DataRow dr = dt.NewRow();
        Myrow1Text = myTable.Rows[0].ToString(SaveFormat.Text).Trim();
        string[] Myresult = Myrow1Text.Split(new string[] { "\r\r" }, StringSplitOptions.RemoveEmptyEntries);
        dr["Variable"] = Myresult[0];
        //dr["Variable"] = Myrow1Text;
        // The second row contins variable name.
        Myrow2Text = Myresult[1];
        //Myrow2Text = myTable.Rows[1].ToString(SaveFormat.Text);
        Myrow2Text = Myrow2Text.Substring(Myrow2Text.LastIndexOf(":") + 1).Trim();
        dr["Variable name"] = Myrow2Text;
        // The third row is empty and the fourth contains definition.
        //Myrow3Text = myTable.Rows[3].ToString(SaveFormat.Text);
        Myrow3Text = Myresult[2];
        Myrow3Text = Myrow3Text.Substring(Myrow3Text.IndexOf(":") + 1).Trim();
        dr["Definition"] = Myrow3Text;
        //Myrow4Text = Myresult[3];
        //Myrow4Text = Myrow4Text.Substring(Myrow4Text.IndexOf(":") + 1).Trim();
        //dr["Reason"] = Myrow4Text;
        dt.Rows.Add(dr);
    }
    lblCount.Text = elementCount.ToString();
}
catch (ArgumentException e)
{
    lblError.Text = e.Message;
    throw;
}



// Now build a table with the extracted data.
Document result = new Document();
DocumentBuilder builder = new DocumentBuilder(result);
builder.StartTable();

foreach (DataColumn column in dt.Columns)
{
    builder.InsertCell();
    builder.ParagraphFormat.Alignment = ParagraphAlignment.Left;
    builder.Write(column.ColumnName.ToString());
}
builder.EndRow();
foreach (DataRow row in dt.Rows)
{
    foreach (var item in row.ItemArray)
    {
        builder.InsertCell();
        builder.ParagraphFormat.Alignment = ParagraphAlignment.Left;
        builder.Write(item.ToString());
    }
    builder.EndRow();
}

Admission Schema Manual PGM.docx (151.1 KB)

@pmick Most likely, the problem occurs because Aspose.Words is used in evaluation mode, i.e. without license. In evaluation mode Aspose.Words limits the maximum size of the processed document to several hundreds of paragraphs. So your document is truncated.
If you would like to test new version of Aspose.Words without evaluation version limitations, you can request a free 30-days temporary license .
Please see our documentation to learn more about licensing:
https://docs.aspose.com/words/net/licensing/

Ah I wondered if that was the issue! I created my account with personal email rather than work email. Is there a way of updating my email address or do I need to create a second account?

Thanks,
Paul

@pmick You can create another account if you would like to purchase the license with work email. If you would like simply get a temporary license for testing, you can use your personal account.

I get an error message when applying for temp licence:

Your request for a temporary license has not been successful because your account was created using a free email.

You need to log in using a business/company account.

@pmick I see, then it is required to create new account with business e-mail account.