Free Support Forum - aspose.com

Extract Text with styles from Word Documents

Hi,
Is there a way to extract text from Word Document by preserving styles such as font, margin, bullets?
Also, can we extract content from tables in proper format. i.e. content in multiples columns are in the same line.

@Shishir_Khadka,

Thanks for your inquiry. Please ZIP and attach your input and expected output Word documents here for our reference. We will then provide you more information about your query along with code.

Hi @tahir.manzoor
I have attached sample word document and expected output.
I am currently converting the word to html using aspose and using script to extract text from html.
I am looking to check if there is a way to get the text along with style such as bold/italics that we get when converting to HTML along with the attached output.
Let me know.

word_text.zip (23.7 KB)

@Shishir_Khadka,

Thanks for sharing the detail. Please use the following code example to get the desired output as shared in “output_text.txt”.

Document doc = new Document(MyDir + "sample_doc.doc");
TxtSaveOptions options = new TxtSaveOptions();
options.ExportHeadersFooters = false;
doc.Save(MyDir + "output.txt", options);

Please note that HTML and TXT are different file formats. You cannot set bold/italic formatting of text in TXT file format. Could you please elaborate what exact you want to achieve using Aspose.Words? We will then provide you more information about your query?

Thank you @tahir.manzoor
We are getting the formatting by converting Word to HTML. I wanted to check if we could skip HTML conversion and get the styles directly from Word Docs. It is okay if we can’t do that with Text extraction.

But is it possible to get the text in tables in the same line instead of new line for each column?
E.g. a. My text is extracted as :
a.
My text
I have attached word doc and sample expected output.

word_table_text.zip (22.5 KB)

@Shishir_Khadka,

Thanks for your inquiry. Please use TxtSaveOptions.PreserveTableLayout as shown below to get the desired output. Hope this helps you.

Document doc = new Document(MyDir + "sample_doc.doc");
TxtSaveOptions options = new TxtSaveOptions();
options.PreserveTableLayout = true;
options.ExportHeadersFooters = false;
                 
doc.Save(MyDir + "18.5.txt", options);