Extract all Text from Word DOCX & PDF Documents using C# or Java | Convert Word to TXT | Save PDF as TXT

saurabh.arora · May 14, 2020, 10:26pm

Hi Team,

We have to extract complete text from word and pdf documents. Is there any direct api to achieve this or I have to split the document in pages and then extract the contents.

Please help.

Thanks.

awais.hafeez · May 15, 2020, 5:42am

@saurabh.arora,

To get string of all the text in MS Word document, please use the following C# code of Aspose.Words for .NET API:

Document doc = new Document("input.docx");
string text = doc.ToString(SaveFormat.Text);

Or you can convert Word document to TXT format in memory and then obtain text representation of memory stream:

Document doc = new Document("input.docx");

MemoryStream stream = new MemoryStream();
doc.Save(stream, SaveFormat.Text);
stream.Position = 0;

string text = Encoding.UTF8.GetString(stream.ToArray());

Regarding extracting complete text from PDF documents, please refer to the following article:

Extract Text From All the Pages of a PDF Document

Hope, this helps.