We're sorry Aspose doesn't work properply without JavaScript enabled.

Free Support Forum - aspose.com

Read a word doc into a database

I want to read a word doc, identify every paragraph every style and every element in the document and store matching pattern in db.
I’m using C#.net … How can I do it using aspose.words?

Hi

Thanks for your request. First of all I think you should see the following link to learn more about Aspose.Words.Document object model

You can loop through all child nodes of your document and get formatting of these nodes.

Also you can just save the document in the DB. See the following code to learn how to read/write document from DB.

/// <summary>
/// This example shows how to write document into the database.
/// </summary>
public void Example002()
{
    //Create connction
    string connectionString = "server=srv;database=TestDB;uid=sa;pwd=psw;";
    System.Data.SqlClient.SqlConnection conn = new System.Data.SqlClient.SqlConnection(connectionString);
    // Open the DOC file using Aspose.Words.
    Document doc = new Document(@"C:\Temp\in.doc");
    // ...You can merge data/manipulate document content here.
    MemoryStream stream = new MemoryStream();
    //Save document to memorystream
    doc.Save(stream, SaveFormat.Doc);
    //Create sql command
    string commandString = "INSERT INTO [Documents] VALUES(@Doc)";
    System.Data.SqlClient.SqlCommand command = new System.Data.SqlClient.SqlCommand(commandString, conn);
    //Add paramenter @Doc
    command.Parameters.AddWithValue("Doc", stream.GetBuffer());
    //Open connection
    conn.Open();
    //Write document to DB
    command.ExecuteNonQuery();
    //Close DB connection
    conn.Close();
}


/// <summary>
/// This example shows how to read document from the database.
/// </summary>
public void Example003()
{
    //Create connction
    string connectionString = "server=srv;database=TestDB;uid=sa;pwd=psw;";
    System.Data.SqlClient.SqlConnection conn = new System.Data.SqlClient.SqlConnection(connectionString);
    //Create DataSet
    DataSet ds = new DataSet();
    //Create sql command
    string commandString = "SELECT FileContent FROM Documents WHERE ID=1";
    System.Data.SqlClient.SqlCommand command = new System.Data.SqlClient.SqlCommand(commandString, conn);
    //Create data adapter
    System.Data.SqlClient.SqlDataAdapter adapter = new System.Data.SqlClient.SqlDataAdapter(command);
    //Open connection
    conn.Open();
    //Read dataset
    adapter.Fill(ds);
    conn.Close();
    //Save document to hard disk
    if (ds.Tables.Count > 0)
    {
        if (ds.Tables[0].Rows.Count > 0)
        {
            byte[] buffer = (byte[])ds.Tables[0].Rows[0][0];
            MemoryStream stream = new MemoryStream(buffer);
            Document doc = new Document(stream);
            doc.Save(@"C:\Temp\out.doc");
        }
    }
}

I hope this information could be useful for you. Please let me know in case of any issues.

Best regards.

Spacibo Alexey,

I think you misunderstood my question. Actually I have many word files that I need to read them, and extract every title, sections, paragraph, style, images, table, cell etc… and put their texts in the corresponding tables (sections, paragraph, styles etc…) or creating a structured XML file containing all sequential data.

I found something interesting on your site about DocumentVisitor and the example coming with it, but I don’t know if it’s the best way to proceed. If it’s the case, can you please send me more examples for covering all existing elements in word documents using DocumentVisitor? If not, what it’s the best way?

Best Regards

Hi

Thanks for your request. Yes, I think using DocumentVisitor is the best way to achieve what you need.

You can find nice example here:

Hope this helps.

Best regards.

Thanks Alexey,

I’m in stage of testing what aspose.words can help us to achieve our goals; I saw the example that you sent and that is why I addressed to you, because it is not sufficient, so we need more example document that you can go further and by the product

I searched the forum and I did not find any thing interesting.

So could you please help me?

Regards

I am using the example shown in the DocumentVisitor help to write out the text of a word document. It works okay, but to determine whether a Run object is an actual Word section heading (ie has a style Heading 1 or 2 etc in the Word document.) you recommend the following solution:

private bool CheckIsHeader(Run run)
{
    Paragraph para = (Paragraph)run.GetAncestor(typeof(Aspose.Words.Paragraph));
    return para != null && (para.ParagraphFormat.StyleIdentifier == StyleIdentifier.Heading1 || para.ParagraphFormat.StyleIdentifier == StyleIdentifier.Heading2);
}

This only work for predefined style heading (Heading 1 or 2 etc) but it doesn’t work for user defined style based on predefined heading (Heading 1 or 2 etc)

So what can I do to make the system recognise user’s style heading?

thanks

Hello!

You can retrieve this information via ParagraphFormat.Style.BaseStyle which returns base style name, not a Style object:

Style names can be mapped to Style objects using style collection available in Document and Style classes:

Recurse or iterate until you get Heading1 or Heading2 or empty string. If one of expected headings is found then the corresponding paragraph is a section heading. Note that we saw some (incorrect) documents with cyclic style dependencies so it’s better to check name repetitions.

Article about DocumentVisitor gives general idea how to use this class for document iteration. Application logic should depend on your particular goals. You can try creating your own visitor and ask more specific questions showing us the code.

Regards,

Hi Victor,

Thanks for the information.

Could you please write in code how to do it?

Like the function below.

private bool CheckIsHeader(Run run)
{
    Paragraph para = (Paragraph)run.GetAncestor(typeof(Aspose.Words.Paragraph));
    return para != null && (para.ParagraphFormat.StyleIdentifier == StyleIdentifier.Heading1 || para.ParagraphFormat.StyleIdentifier == StyleIdentifier.Heading2);
}

Thanks

How do you think it will work with my code?

private StyleIdentifier GetStyleIdentifier(Run run, Style st)
{
    Style style_temp = st;
    StyleIdentifier styleIdent = StyleIdentifier.User;
    while (style_temp.BaseStyle != string.Empty)
    {
        if (style_temp.StyleIdentifier == StyleIdentifier.Heading1 || style_temp.StyleIdentifier == StyleIdentifier.Heading2 ||
        style_temp.StyleIdentifier == StyleIdentifier.Heading3 || style_temp.StyleIdentifier == StyleIdentifier.Heading4 ||
        style_temp.StyleIdentifier == StyleIdentifier.Heading5 || style_temp.StyleIdentifier == StyleIdentifier.Heading6 ||
        style_temp.StyleIdentifier == StyleIdentifier.Heading7 || style_temp.StyleIdentifier == StyleIdentifier.Heading8 ||
        style_temp.StyleIdentifier == StyleIdentifier.Heading9)
        {
            styleIdent = style_temp.StyleIdentifier;
            break;
        }
        else if (style_temp.StyleIdentifier == StyleIdentifier.User)
        {
            style_temp = run.Document.Styles[style_temp.BaseStyle];
        }
        else
        {
            styleIdent = style_temp.StyleIdentifier;
            break;
        }
    }
    return styleIdent;
}

Hi

Thanks for your request. If you need get StyleIdentifier of Run then I think the following method could do that for you.

private StyleIdentifier GetStyleIdentifier(Run run)
{
    //Get parent paragraph of Run
    Paragraph parent = run.ParentParagraph;

    //Return StyleIndentifier of Paragraph
    return parent.ParagraphFormat.StyleIdentifier;
}

Hope this helps.

You are free to try your code on your side :wink:

Best regards.