We're sorry Aspose doesn't work properply without JavaScript enabled.

Free Support Forum - aspose.com

How can we get text of pages

We want to extract word file page by page, how can we do it?

Hi Ross,


Thanks for your inquiry.

Please note that Microsoft Word documents are flow documents and are not natively laid out into lines and pages. I think, in your case you can achieve splitting of document into pages by using the utility methods available in the attached ‘PageNumberFinder’ class. For example, you can use the code like below to extract pages to an external document.

Document doc = new Document(“Document.docx”);<o:p></o:p>

// Set up the document which pages will be copied to. Remove the empty section. <o:p></o:p>

Document dstDoc = new Document();<o:p></o:p>

dstDoc.RemoveAllChildren();<o:p></o:p>

PageNumberFinder finder = new PageNumberFinder(doc);<o:p></o:p>

// Split nodes which are found across pages.<o:p></o:p>

finder.SplitNodesAcrossPages(true);<o:p></o:p>

// Copy all content including headers and footers from the specified pages into the destination document.

ArrayList pageSections = finder.RetrieveAllNodesOnPage(3, 5, NodeType.Section);<o:p></o:p>


foreach (Section section in pageSections)

dstDoc.AppendChild(dstDoc.ImportNode(section, true));


dstDoc.Save(dataDir + “Document Out.docx”);<o:p></o:p>

In case you have further inquires or need any help, please let us know.

Best regards,

Thank you so much for quick reply, I tested your code by a very simple file , but it didn’t work correctly.

here is my source code:

//----------------------------------------------------------------------------------------------
public static List ExtractPages(string orgFilename)
{
var pages = new List();
var doc = new Aspose.Words.Document(orgFilename);
PageNumberFinder finder = new PageNumberFinder(doc);
finder.SplitNodesAcrossPages(true);
Aspose.Words.Document dstDoc = new Aspose.Words.Document();
for (int i = 1; i <= doc.PageCount; i++)
{
dstDoc.RemoveAllChildren();
ArrayList pageSections = finder.RetrieveAllNodesOnPages(i, i + 1, NodeType.Section);
foreach (Section section in pageSections)
dstDoc.AppendChild(dstDoc.ImportNode(section, true));
pages.Add(dstDoc.GetText());
}
return pages;
}
//----------------------------------------------------------------------------------------------
the sample file contains 4 pages that all pages has their own content but the result is an array that item 0 has all content of the file and the other items of array are empty!! it seems this class merge all contents into one page!!



Hi Ross,


Thanks for your inquiry. The code posted in my previous answer splits all pages into separate sections and then merges them into one output document. However, if you want to save pages to separate Word files, please try run the following code:

Document doc = new
Document(@“C:\Temp\multipage.docx”);

Document dstDoc = new Document();

PageNumberFinder finder = new PageNumberFinder(doc);

finder.SplitNodesAcrossPages(true);

ArrayList pageSections = finder.RetrieveAllNodesOnPages(1, 4, NodeType.Section);

for (int i = 0; i < pageSections.Count; i++)

{

dstDoc.RemoveAllChildren();

dstDoc.AppendChild(dstDoc.ImportNode((Section)pageSections[i], true));

dstDoc.Save(@"C:\Temp\out_" + i + ".docx");

}


Please let me know if I can be of any further assistance.

PS: In case you're using an older version of Aspose.Words, I would suggest you please upgrade to the latest version of Aspose.Words i.e. 13.7.0 from here. I hope, this helps.

Best regards,

That code didn’t work correctly and generates just one page for me!

I tried to convert the word file to PDF and then extract texts from PDF pages, it seems to be working.

var pages = new List();
var doc = new Aspose.Words.Document(Filename);
doc.Save( “c:\temp\temp.pdf”);
var docpdf = new Aspose.Pdf.Document(“c:\temp\temp.pdf”);
for (int i = 1; i <= docpdf.Pages.Count; i++)
{
var p = docpdf.Pages[i];
TextAbsorber textAbsorber = new TextAbsorber();
p.Accept(textAbsorber);
pages.Add(textAbsorber.Text);
}

Hi Ross,


It’s great you were able to find what you were looking for using Aspose.Pdf. However, I have attached a sample console application here for your reference. Please try execute this with Aspose.Words 13.8.0 on your side and let me know how it goes? I hope, this helps. Please let us know any time you have any further queries.

Best regards,