Hello,
How can I extract all tables in Aspose using Aspose.Words for Python via .NET. I would want to extract them into pandas dataframes and also as plain markdown text.
Hello,
How can I extract all tables in Aspose using Aspose.Words for Python via .NET. I would want to extract them into pandas dataframes and also as plain markdown text.
To extract all tables from Word documents using Aspose.Words for Python via .NET and convert them into pandas DataFrames as well as plain markdown text, you can follow these steps:
Load the Document: First, you need to load the Word document using Aspose.Words.
Extract Tables: Iterate through the tables in the document and extract the data.
Convert to DataFrame: Use the extracted data to create pandas DataFrames.
Convert to Markdown: Format the extracted data as markdown text.
Here’s a sample code snippet demonstrating these steps:
import pandas as pd
from aspose.words import Document
# Load the Word document
doc = Document("path/to/your/document.docx")
# Initialize a list to hold DataFrames
dataframes = []
# Iterate through each table in the document
for table in doc.get_child_nodes(NodeType.TABLE, True):
# Extract data from the table
table_data = []
for row in table.get_rows():
row_data = []
for cell in row.get_cells():
row_data.append(cell.to_string(SaveFormat.TEXT).strip())
table_data.append(row_data)
# Create a DataFrame from the table data
df = pd.DataFrame(table_data)
dataframes.append(df)
# Convert the table data to markdown format
markdown_table = df.to_markdown(index=False)
print(markdown_table)
# Now you have all tables as DataFrames in the dataframes list
to_markdown
method.This approach allows you to work with the tables in both DataFrame format for data manipulation and markdown format for documentation or display purposes.
Make sure to install the required libraries if you haven’t done so already:
pip install pandas aspose-words
This should help you extract tables from Word documents effectively!
Is there a way for Aspose to create the Markdown text not Pandas?
Also the code for row in table.get_rows():
gives the error:
@ln22 The code provided by AI is not quite correct. Please try suing the following code:
doc = aw.Document("C:\\Temp\\in.docx")
for table in doc.get_child_nodes(aw.NodeType.TABLE, True):
table = table.as_table()
for row in table.rows:
row = row.as_row()
for cell in row.cells:
cell = cell.as_cell()
print(cell.to_string(aw.SaveFormat.TEXT).strip())
print("----------------")
How can I get the table as plain markdown text from Aspose?
@ln22 You can copy the table into a separate document and save the document as markdown:
# Load the Word document
doc = aw.Document("C:\\Temp\\in.docx")
# get table and put it into a separate document
tmp = aw.Document()
table = doc.get_child(aw.NodeType.TABLE, 0, True);
tmp.first_section.body.prepend_child(tmp.import_node(table, True))
# save temporary document as markdown
tmp.save("C:\\Temp\\out.md", aw.SaveFormat.MARKDOWN)
Is it possible to get this text in python object without saving it to a file?
@ln22 You can save the output to stream and then get string from the stream.
https://reference.aspose.com/words/python-net/aspose.words/document/save/#bytesio_saveformat
Thanks that works great!
I found when I use the above codebase, footnotes that are cited within the text of the cells within the table are then not included in the output table. Is there a way to keep these footnotes in the output?
@ln22 As I can see footnotes are exported to markdown. Could you please attach your sample input and expected output documents here for out reference? This will help us to better understand your requirements.
Sorry I was mistaken. Thanks for your help.
Hello,
I followed the code you showed and it was working great but now getting a very strange error when try to stream aspose table object to string with output_aspose_format=aw.SaveFormat.HTML. Here is my code and error:
Code:
def aspose_table_to_string(aspose_w_table_node: aw.Node, output_aspose_format: aw.SaveFormat) -> str:
"""
Takes in aspose words table objet and turns into table output string
:param aspose_w_table_node: Table node object from aspose
:type aspose_w_table_node: aw.Node
:param output_aspose_format: aspose save format
:type output_aspose_format: aw.SaveFormat
:return: table output string
:rtype: str
"""
try:
# Convert the table data to specified format
tmp = aw.Document()
tmp.first_section.body.prepend_child(tmp.import_node(aspose_w_table_node, True))
output_stream = BytesIO()
tmp.save(output_stream, output_aspose_format)
output_stream.seek(0)
table_str = output_stream.read().decode('utf-8')
return table_str
except Exception as e:
logging.info(f"Failed to convert table object to string with error {e}."
f"Empty string will be returned.")
return ""
Error:
RuntimeError('Proxy error(InvalidOperationException): Image file cannot be written to disk. When saving to a stream or to a string either ImagesFolder should be specified, or custom streams should be provided via ImageSavingCallback, or ExportImagesAsBase64 should be set to true. Please see documentation for details.')
Any idea of how to fix this error?
Here is an example doc with table that throws this error:
Aspose_testing_table_with_image.docx (21.4 KB)
@ln22 There is an image in your table. You can simply specify HtmlSaveOptions.export_images_as_base64
property to export images as embedded base64 in HTML, just like suggested in the exception message.
That update worked great and led to even better outputs than before. Thanks so much!