Splitting Word document into different files

RedBandit · May 8, 2008, 6:27am

Hello.

We need to split a Word document into different files. For each entry of the TOC there must be a seperate file. Can this be done with your framework?
We are using Java as Programming language.

Thanks a lot.

alexey.noskov · May 8, 2008, 8:23am

Hi
Thanks for your request. Yes I think that you can achieve this using Aspose.Words. For example see the attached document and the following code:

//Open document
Document doc = new Document("in.doc");
//Get collection of Paragraphs
NodeCollection paragraphs = doc.getChildNodes(NodeType.PARAGRAPH, true);
Paragraph par = null;
int docIndex = 0;
//Loop through all paragraphs in the document
for (int parIndex = 0; parIndex < paragraphs.getCount(); parIndex++)
{
    par = (Paragraph)paragraphs.get(parIndex);
    //If Paragraph style = HEADING_1 then copy content to the new document
    if (par.getParagraphFormat().getStyleIdentifier() == StyleIdentifier.HEADING_1)
    {
        //Create new document
        Document outDoc = new Document();
        Node currentNode = par;
        while (currentNode != null)
        {
            //Import Node
            Node importedNode = outDoc.importNode(currentNode, true, ImportFormatMode.KEEP_SOURCE_FORMATTING);
            //insert node into the new document
            outDoc.getFirstSection().getBody().appendChild(importedNode);
            //If next node=null then move to the next section
            if (currentNode.getNextSibling() == null)
            {
                //Get next section
                Section currrentSection = (Section)currentNode.getAncestor(NodeType.SECTION).getNextSibling();
                //If next section != null then get its first child
                if (currrentSection != null)
                    currentNode = currrentSection.getBody().getFirstChild();
                else
                    break; //else exit from while
            }
            else
            {
                //Get next node
                currentNode = currentNode.getNextSibling();
            }
            //Check if current node is paragraph
            if (currentNode.getNodeType() == NodeType.PARAGRAPH)
            {
                //Check if its style is HEADING_1
                if (((Paragraph)currentNode).getParagraphFormat().getStyleIdentifier() == StyleIdentifier.HEADING_1)
                {
                    //If so then set par index and exit while
                    parIndex = paragraphs.indexOf(currentNode) - 1;
                    break;
                }
            }
        }
        //Save output document
        outDoc.save("Section_" + String.valueOf(docIndex) + ".doc");
        //increase docIndex
        docIndex++;
    }
}

I hope this could help you.
Best regards.

RedBandit · May 9, 2008, 2:38am

Hi.

Thanks for the fast reply. I tried the above code and it works fine for small documents. We have a very large document (about 130 MB) that has to be split in several documents. When I load that document it seems that only the first part of the document is parsed. When I copy a part of the document into a new Word doc and parse this one, it is working fine again.

Is this a restriction of the evaluation version?

Thanks.

Thomas

alexey.noskov · May 9, 2008, 4:32am

Hi
Thanks for your request. Aspose.Words in evaluation mode limits the maximum document size to several hundred paragraphs. Please see the following link to learn more about limitations.
https://docs.aspose.com/words/net/licensing/
If you want to test Aspose.Words without evaluation version limitations, you can also request a 30-Day Temporary License. See the following link.
https://purchase.aspose.com/temporary-license
Best regards.

RedBandit · May 9, 2008, 7:35am

Ok. With the temporary license it works fine.

Now there is one more open point. How can I copy the page layout (e.g. landscape) , the pagestyle and the background image to the splitted documents.

Thanks.

Thomas

alexey.noskov · May 9, 2008, 9:16am

Hi
Thanks for your request. Yes, of course you can achieve this. Please try using the following code:

//Open document
Document doc = new Document("in.doc");
//Get collection of Paragraphs
NodeCollection paragraphs = doc.getChildNodes(NodeType.PARAGRAPH, true);
Paragraph par = null;
int docIndex = 0;
//Loop through all paragraphs in the document
for (int parIndex = 0; parIndex < paragraphs.getCount(); parIndex++)
{
    par = (Paragraph)paragraphs.get(parIndex);
    //If Paragraph style = HEADING_1 then copy content to the new document
    if (par.getParagraphFormat().getStyleIdentifier() == StyleIdentifier.HEADING_1)
    {
        //Create new document
        Document outDoc = new Document();
        //Remove sections from document
        outDoc.removeAllChildren();
        Node currentNode = par;
        //import section from src document without its children
        Section srcSect = (Section)outDoc.importNode(currentNode.getAncestor(NodeType.SECTION), true, ImportFormatMode.KEEP_SOURCE_FORMATTING);
        outDoc.appendChild(srcSect);
        srcSect.getBody().removeAllChildren();
        while (currentNode != null)
        {
            //Import Node
            Node importedNode = outDoc.importNode(currentNode, true, ImportFormatMode.KEEP_SOURCE_FORMATTING);
            //insert node into the new document
            outDoc.getLastSection().getBody().appendChild(importedNode);
            //If next node=null then move to the next section
            if (currentNode.getNextSibling() == null)
            {
                //Get next section
                Section currrentSection = (Section)currentNode.getAncestor(NodeType.SECTION).getNextSibling();
                //If next section != null then get its first child
                if (currrentSection != null)
                {
                    Section newSect = (Section)outDoc.importNode(currrentSection, true, ImportFormatMode.KEEP_SOURCE_FORMATTING);
                    outDoc.appendChild(newSect);
                    newSect.getBody().removeAllChildren();
                    currentNode = currrentSection.getBody().getFirstChild();
                }
                else
                {
                    break; //else exit from while
                }
            }
            else
            {
                //Get next node
                currentNode = currentNode.getNextSibling();
            }
            //Check if current node is paragraph
            if (currentNode.getNodeType() == NodeType.PARAGRAPH)
            {
                //Check if its style is HEADING_1
                if (((Paragraph)currentNode).getParagraphFormat().getStyleIdentifier() == StyleIdentifier.HEADING_1)
                {
                    //If so then set par index and exit while
                    parIndex = paragraphs.indexOf(currentNode) - 1;
                    break;
                }
            }
        }
        //Save output document
        outDoc.save("Section_" + String.valueOf(docIndex) + ".doc");
        //increase docIndex
        docIndex++;
    }
}

I hope this could help you.
Best regards.

RedBandit · May 13, 2008, 2:43am

Thanks! That did it.

But there is another problem. At the beginning of each splitted document there is a Page Break. How can I remove this when creating the splits. Just remove the first Node/Paragraph Node?

//Thomas

alexey.noskov · May 13, 2008, 4:34am

Hi
Thanks for your inquiry. Maybe this occurs because there are page breaks between sections in your document. Could you attach your document or part of the document for testing? I will investigate this issue and try to help you.
Best regards.

RedBandit · May 14, 2008, 2:07am

Ok. I created a document that shows the problem.

When splitting the document using the above code, The first split has a carriage return at the beginning and the second document a Page Break.

Can this be solved?

Thanks.

Thomas Ospelt

alexey.noskov · May 14, 2008, 4:55am

Hi
Thank you for additional information. I think that you can try using the following code to resolve this problem.

//Get first Paragraph
Paragraph firstPar = outDoc.getFirstSection().getBody().getFirstParagraph();
//Remove PageBreaks in the first paragraph
for (int runIndex = 0; runIndex < firstPar.getRuns().getCount(); runIndex++)
{
    firstPar.getRuns().get(runIndex).setText(firstPar.getRuns().get(runIndex).getText().replace("\f", ""));
}
//Save output document
outDoc.save("Section_" + String.valueOf(docIndex) + ".doc");
//increase docIndex
docIndex++;

Hope this helps.
Best regards.