How to split pages to new documents

mongooseBob · August 14, 2012, 5:51am

Hi there,

We are evaluating ASPOSE word to solve an issue we have where we need to be able to
a) Load up a .doc file
b) create a new file for each page in this .doc file
c) save the new one page doc as a .docx

This will be running as a windows service, using .net 4.0.

Are there any examples of doing this that you have, if indeed such a thing can be done with ASPOSE?

Many thanks

alexey.noskov · August 14, 2012, 11:52am

Hi
Thanks for your request. Unfortunately, there is no direct way to find where page starts or ends. MS Word document is flow document and does not contain any information about its layout into lines and pages. So there is no way to determine where page starts or ends using Aspose.Words.
However, as a workaround you can try using PageNumberFinder class suggested by Adam in this thread:
https://forum.aspose.com/t/58199
Best regards,

wuyanlin · January 10, 2013, 12:01am

Hi there,

I notice this split document topic which posted year age, and wonder if there is a solution for splitting requirement?
In another word, is it possible to split pages by page number, keep the format of each page as the source page, and save them as MS word files now?

I have found a similar solution in forums here.
But I dont know how to achieve my requirment.

Would you please give me a hand?

Thank you for your time.

tahir.manzoor · January 10, 2013, 5:10am

Hi Yanlin,

Thanks for your inquiry. Word document is flow document and does not contain any information about its layout into lines and pages. Therefore, technically there is no “Page” concept in Word document.

Aspose.Words uses our own Rendering Engine to layout documents into pages and we have plans to expose layout information. Your request has been linked to the appropriate feature. You will be notified as soon as this feature is supported.

In the mean time you can try using the PageFinder code, please see the attachment. Using this code you can extract only the nodes in the page range you want. Please see the code below.

Document doc  = new Document(MyDir + "in.doc");

for (int page = 1; page <= doc.PageCount; page++)
{
    Document docCopy = doc.Clone();
    PageNumberFinder finder = new PageNumberFinder(docCopy);

    // Split all nodes in the document including sections so they appear on one page only.
    finder.SplitNodesAcrossPages(true);

    // Remove any nodes on pages that are outside our desired page range.
    ArrayList sectionsToRemove = finder.RetrieveAllNodesOnPages(0, page - 1, NodeType.Section);
    sectionsToRemove.AddRange(finder.RetrieveAllNodesOnPages(page + 1, doc.PageCount + 1, NodeType.Section));

    foreach (Section section in sectionsToRemove)
        section.Remove();

    // All that should remain is the content from the desired page range. Save this content to disk in the appropriate format.
    docCopy.Save(MyDir + "out_" + page + ".doc");
}

jihu31 · January 10, 2013, 8:27pm

Hi Tahir,

I take responsibility of this job now, And Yanlin is assigned to other task.

I test code you provided, the content in per page splitted is not good enough as the content in each page.

Here is my code:

public static void SplitDoc3()
{
    Document doc = new Document(strFilePath);
    Document result = new Document();
    DocumentBuilder dbuilderresult = new DocumentBuilder(result);
    for (int page = 1; page <= doc.PageCount; page++)
    {
        Document docCopy = doc.Clone();
        PageNumberFinder finder = new PageNumberFinder(docCopy);

        // Split all nodes in the document including sections so they appear on one page only.
        finder.SplitNodesAcrossPages(true);

        // Remove any nodes on pages that are outside our desired page range.
        ArrayList sectionsToRemove = finder.RetrieveAllNodesOnPages(0, page - 1, NodeType.Section);

        sectionsToRemove.AddRange(finder.RetrieveAllNodesOnPages(page + 1, doc.PageCount + 1, NodeType.Section));
        foreach (Section section in sectionsToRemove)
        {
            section.Remove();
        }

        // All that should remain is the content from the desired page range. Save this content to disk in the appropriate format.
        docCopy.Save(strDocDir + strDocName.Substring(0, strDocName.IndexOf('.')) + "_" + page.ToString() + ".docx", SaveFormat.Docx);
    }
}

tahir.manzoor · January 11, 2013, 6:51am

Hi Xiaohua,

Thanks for your inquiry. I have tested the scenario with your document and have found the same problem. The shared code is a workaround for this missing feature (WORDSNET-5643). We will update you via this forum thread once this feature is available.

However, I have noticed that there are some empty paragraphs at the end of output document. you can remove these empty paragraphs by using following code snippet.

Document doc = new Document(MyDir + "Test1.doc");

// Remove the empty paragraphs if necessary.
while (doc.LastSection.Body.LastParagraph.ToString(SaveFormat.Text) != "")
{
    if (doc.LastSection.Body.LastParagraph.PreviousSibling != null &&
    doc.LastSection.Body.LastParagraph.PreviousSibling.NodeType != NodeType.Paragraph)
        break;

    doc.LastSection.Body.LastParagraph.Remove();

    // If the current section becomes empty, we should remove it.
    if (!doc.LastSection.Body.HasChildNodes)
        doc.LastSection.Remove();

    // We should exit the loop if the document becomes empty.
    if (!doc.HasChildNodes)
        break;
}

// Save output.
doc.Save(MyDir + "out.doc");

jihu31 · January 11, 2013, 8:57am

Hi Tahir,

Thank you for the replies.

What is your schedule on this feature(WORDSNET-5643)? This case is urgent for our team, and we have to test all features relate to ms doc operation in short time.

I test your code with Demo.doc, which had been attached here.

It seems something wrong with your code in while phrase. I get a message like this: “The name ‘SaveFormat’ does not exist in the current context”.

And the output contents of documents saved are not satisfied my requirement.

Please double check on the phrases that use to remove the empty paragraphs if necessary.

Should a paragraph started with “\f” be removed? how many kind of types like this should be removed?

What happens here?

My code here:

public static void SplitDoc()
{
    Document doc = new Document(strFilePath);
    foreach (Section sec in doc.Sections)
    {
        //page number is 0-based.
        for (int iPage = 1; iPage <= doc.PageCount; iPage++)
        {
            //copy doc, then split it.
            Document docCopy = doc.Clone();
            PageNumberFinder finder = new PageNumberFinder(docCopy);

            // Split all nodes in the document including sections so they appear on one page only.
            finder.SplitNodesAcrossPages(true);

            // Remove any nodes on pages that are outside our desired page range.
            //0 to pre-page
            ArrayList sectionsToRemove = finder.RetrieveAllNodesOnPages(0, iPage - 1, NodeType.Section);

            //next page to the last page
            sectionsToRemove.AddRange(finder.RetrieveAllNodesOnPages(iPage + 1, doc.PageCount + 1, NodeType.Section));
            foreach (Section section in sectionsToRemove)
            {
                section.Remove();
            }

            // Remove the empty paragraphs if necessary.
            while (docCopy.LastSection.Body.LastParagraph.ToString(SaveFormat.Text) != "")

                if (docCopy.LastSection.Body.LastParagraph.PreviousSibling != null &&
                    docCopy.LastSection.Body.LastParagraph.PreviousSibling.NodeType != NodeType.Paragraph)
                {
                    break;
                }

            docCopy.LastSection.Body.LastParagraph.Remove();

            // If the current section becomes empty, we should remove it.
            if (!docCopy.LastSection.Body.HasChildNodes)
            {
                docCopy.LastSection.Remove();
            }

            // We should exit the loop if the document becomes empty.
            if (!docCopy.HasChildNodes)
            {
                break;
            }
        }

        docCopy.Save(strDocDir + "out_" + iPage + ".docx");
    }
}

tahir.manzoor · January 14, 2013, 2:51am

Hi Xiaohua,

Thanks for your inquiry. I am afraid this feature WORDSNET-5643 has now been postponed till a later date due to some other important issues and new features. We will inform you as soon as there are any further developments. We apologize for your inconvenience.

jihu31:
Please double check on the phrases that use to remove the empty paragraphs if necessary.
Should a paragraph started with “\f” be removed? how many kind of types like this should be removed?

Regarding the workaround code, I will check it and update you soon. Thanks for your patience.

tahir.manzoor · January 16, 2013, 8:18am

Hi Xiaohua,

Thanks for your patience and sorry for your inconvenience.

The condition in while loop was wrong, I have modified it, please see below. Please use the following code snippet to remove the empty paragraphs at the end of documents.

Document doc = new Document(MyDir + "in.doc");

// Remove the empty paragraphs if necessary.
while (doc.LastSection.Body.LastParagraph.ToString(SaveFormat.Text).Trim() == "")
{
    if (doc.LastSection.Body.LastParagraph.PreviousSibling != null &&
    doc.LastSection.Body.LastParagraph.PreviousSibling.NodeType != NodeType.Paragraph)
        break;

    doc.LastSection.Body.LastParagraph.Remove();

    // If the current section becomes empty, we should remove it.
    if (!doc.LastSection.Body.HasChildNodes)
        doc.LastSection.Remove();

    // We should exit the loop if the document becomes empty.
    if (!doc.HasChildNodes)
        break;
}

doc.Save(MyDir + "out.docx");

msankeshwari · January 29, 2013, 11:09pm

By adding code to remove empty paragraphs at end. The last image in document is missing after converting. If image is present in last in document.

tahir.manzoor · January 30, 2013, 9:08am

Hi Xiaohua,

Thanks for your inquiry. Could you please attach your input Word document here for testing? I will investigate the issue on my side and provide you more information.

aspose.notifier · February 3, 2013, 11:54am

The issues you have found earlier (filed as WORDSNET-2978) have been fixed in this .NET update and this Java update.

This message was posted using Notification2Forum from Downloads module by aspose.notifier.
(83)

jihu31 · July 8, 2013, 7:46am

will you please give me some sample code?

thank you !

tahir.manzoor · July 9, 2013, 11:16am

Hi Xiaohua,

Thanks for your inquiry. Please check DocumentLayoutHelper and EnumerateLayoutElements projects sample from the offline samples pack. This sample demonstrates how to easily work with the layout elements of a document and access the pages, lines, spans etc.

Hope this helps you. Please let us know if you have any more queries.

aspose.notifier · April 11, 2021, 7:09am

The issues you have found earlier (filed as WORDSNET-5643) have been fixed in this Aspose.Words for .NET 21.4 update and this Aspose.Words for Java 21.4 update.