Extract Text from Word Document using C#

AdawadkarVedant · October 26, 2021, 11:06am

Hello,
We are using Aspose.Words 18.11.0 in .NET Core.
We have a document that has footers with pages having section breaks so the footers on certain pages are different.
We are using Document.GetText() method in order to get all the text from the document and store it in a string and then use regex on that string to get certain tokens from that string.

But we are facing an issue where the GetText method only returns the footer text once for each section whereas the body text from different pages is returned correctly.

So for example, If there are 3 sections - S1, S2, S3 - in the document where S1 has 3 pages, S2 has 2 pages, and S3 has 1 page. What I think happens when we use GetText is that it returns text which only includes footer text from 1st page in S1 but body text is returned from all pages in that section, similar thing happens for S2 and S3. So essentially I am getting footer for only first page for all 3 sections.
We want the text from all 6 pages where the text also includes the footer text as well in that string.

Could you help me in understanding how we can achieve that?
I am attaching a sample document with this ticket.

TestFooterDocument.docx (23.1 KB)

Thank You.

tahir.manzoor · October 26, 2021, 2:35pm

@AdawadkarVedant

You are facing the expected behavior of Aspose.Words. It is hard to meaningfully output headers and footers to text file format or text string because it is not paginated.

Please use the latest version of Aspose.Words for .NET 21.10 and following code example to extract the text of each page. Hope this helps you.

Steps to Extract Text from Word Document using C#

Import Document into Aspose.Words’ DOM
Get the page count of Word document using Document.PageCount property
Call Document.ExtractPages method for each page of Word document to extract text. This method returns Document object
Save the extracted document to text using Node.ToString method

Code to Extract Text from Word Document using C#

Document doc = new Document(MyDir + "TestFooterDocument.docx");
string txt = "";
int pagecount = doc.PageCount;
for (int i = 0; i < pagecount; i++)
{
    Document newdoc = doc.ExtractPages(i, 1);
    txt += newdoc.ToString(SaveFormat.Text);
}