Extract tables from Word Documents

Hello,

How can I extract all tables in Aspose using Aspose.Words for Python via .NET. I would want to extract them into pandas dataframes and also as plain markdown text.

@ln22

To extract all tables from Word documents using Aspose.Words for Python via .NET and convert them into pandas DataFrames as well as plain markdown text, you can follow these steps:

  1. Load the Document: First, you need to load the Word document using Aspose.Words.

  2. Extract Tables: Iterate through the tables in the document and extract the data.

  3. Convert to DataFrame: Use the extracted data to create pandas DataFrames.

  4. Convert to Markdown: Format the extracted data as markdown text.

Here’s a sample code snippet demonstrating these steps:

import pandas as pd
from aspose.words import Document

# Load the Word document
doc = Document("path/to/your/document.docx")

# Initialize a list to hold DataFrames
dataframes = []

# Iterate through each table in the document
for table in doc.get_child_nodes(NodeType.TABLE, True):
    # Extract data from the table
    table_data = []
    for row in table.get_rows():
        row_data = []
        for cell in row.get_cells():
            row_data.append(cell.to_string(SaveFormat.TEXT).strip())
        table_data.append(row_data)
    
    # Create a DataFrame from the table data
    df = pd.DataFrame(table_data)
    dataframes.append(df)

    # Convert the table data to markdown format
    markdown_table = df.to_markdown(index=False)
    print(markdown_table)

# Now you have all tables as DataFrames in the dataframes list

Explanation:

  • The code starts by loading the Word document.
  • It then iterates through each table found in the document.
  • For each table, it extracts the text from each cell and constructs a list of rows.
  • A pandas DataFrame is created from the extracted data.
  • Finally, the DataFrame is converted to markdown format using the to_markdown method.

This approach allows you to work with the tables in both DataFrame format for data manipulation and markdown format for documentation or display purposes.

Make sure to install the required libraries if you haven’t done so already:

pip install pandas aspose-words

This should help you extract tables from Word documents effectively!

Is there a way for Aspose to create the Markdown text not Pandas?

Also the code for row in table.get_rows(): gives the error:

  • ‘aspose.words.Node’ object has no attribute ‘get_rows’

@ln22 The code provided by AI is not quite correct. Please try suing the following code:

doc = aw.Document("C:\\Temp\\in.docx")

for table in doc.get_child_nodes(aw.NodeType.TABLE, True):
    table = table.as_table()
    for row in table.rows:
        row = row.as_row()
        for cell in row.cells:
            cell = cell.as_cell()
            print(cell.to_string(aw.SaveFormat.TEXT).strip())
        print("----------------")

How can I get the table as plain markdown text from Aspose?

@ln22 You can copy the table into a separate document and save the document as markdown:

# Load the Word document
doc = aw.Document("C:\\Temp\\in.docx")
# get table and put it into a separate document
tmp = aw.Document()
table  = doc.get_child(aw.NodeType.TABLE, 0, True);
tmp.first_section.body.prepend_child(tmp.import_node(table, True))
# save temporary document as  markdown 
tmp.save("C:\\Temp\\out.md", aw.SaveFormat.MARKDOWN)

Is it possible to get this text in python object without saving it to a file?

@ln22 You can save the output to stream and then get string from the stream.
https://reference.aspose.com/words/python-net/aspose.words/document/save/#bytesio_saveformat

Thanks that works great!

1 Like

I found when I use the above codebase, footnotes that are cited within the text of the cells within the table are then not included in the output table. Is there a way to keep these footnotes in the output?

@ln22 As I can see footnotes are exported to markdown. Could you please attach your sample input and expected output documents here for out reference? This will help us to better understand your requirements.

Sorry I was mistaken. Thanks for your help.

1 Like

Hello,

I followed the code you showed and it was working great but now getting a very strange error when try to stream aspose table object to string with output_aspose_format=aw.SaveFormat.HTML. Here is my code and error:

Code:

def aspose_table_to_string(aspose_w_table_node: aw.Node, output_aspose_format: aw.SaveFormat) -> str:
    """
    Takes in aspose words table objet and turns into table output string
    :param aspose_w_table_node: Table node object from aspose
    :type aspose_w_table_node: aw.Node
    :param output_aspose_format: aspose save format
    :type output_aspose_format: aw.SaveFormat
    :return: table output string
    :rtype: str
    """
    try:
        # Convert the table data to specified format
        tmp = aw.Document()
        tmp.first_section.body.prepend_child(tmp.import_node(aspose_w_table_node, True))
        output_stream = BytesIO()
        tmp.save(output_stream, output_aspose_format)
        output_stream.seek(0)
        table_str = output_stream.read().decode('utf-8')
        return table_str
    except Exception as e:
        logging.info(f"Failed to convert table object to string with error {e}."
                     f"Empty string will be returned.")
        return ""

Error:

RuntimeError('Proxy error(InvalidOperationException): Image file cannot be written to disk. When saving to a stream or to a string either ImagesFolder should be specified, or custom streams should be provided via ImageSavingCallback, or ExportImagesAsBase64 should be set to true. Please see documentation for details.')

Any idea of how to fix this error?

Here is an example doc with table that throws this error:
Aspose_testing_table_with_image.docx (21.4 KB)

@ln22 There is an image in your table. You can simply specify HtmlSaveOptions.export_images_as_base64 property to export images as embedded base64 in HTML, just like suggested in the exception message.

1 Like

That update worked great and led to even better outputs than before. Thanks so much!

1 Like