Wrong output when diving word document into single page documents

saurabh.arora · November 28, 2017, 8:05am

Hi,

I am trying to divide a document into single page documents. I have used document splitter. But it is giving me incorrect output. I am attaching my main document (Main.docx) and the output (which is 5 pages but it should be 4 , 3.docx is incorrect).

test1.zip (45.3 KB)

Also attaching my code files –

code.zip (6.0 KB)

Please help. We are struck at this.

tahir.manzoor · November 28, 2017, 4:40pm

@saurabh.arora,

Thanks for your inquiry. We are investigating this issue and will get back to you soon.

saurabh.arora · November 29, 2017, 6:38am

Thanks Tahir.

Can you please give us timeline when will this be resolved. We have to go live in production in next 2 days. It would be great if you could provide the solution as soon as possible. Thanks for all the help.

tahir.manzoor · November 29, 2017, 3:47pm

@saurabh.arora,

Thanks for your patience. We have managed to reproduce this issue at our end. This issue is in PageSplitter utility. Please spare us some time for the investigation of this issue. We will fix this issue as soon as possible and provide you modified code.

We apologize for your inconvenience.

saurabh.arora · December 1, 2017, 8:34am

Hi Tahir,

Is there any update.

We are in dire need of the fix.

tahir.manzoor · December 1, 2017, 3:43pm

@saurabh.arora,

Thanks for your inquiry. Currently, this issue is under analysis phase. Once we analyzed this issue, we will then provide you the ETA of this issue. Thanks for your patience and understanding.

saurabh.arora · December 4, 2017, 8:11am

Hi Tahir,

Is there any update. If possible can you provide a workaround for the problem?

tahir.manzoor · December 4, 2017, 4:33pm

@saurabh.arora,

Thanks for your inquiry. Please try following code example. Hope this helps you.

Document doc = new Document(MyDir + "Main.docx");
int pagecount = doc.getPageCount();

NodeCollection paragraphs = doc.getChildNodes(NodeType.PARAGRAPH, true);
LayoutCollector collector = new LayoutCollector(doc);

for (Paragraph paragraph : (Iterable<Paragraph>) paragraphs) {
    for (Run run : paragraph.getRuns())
    {
        if(collector.getStartPageIndex(run) != collector.getEndPageIndex(run))
        {
            doc.getRange().replace(" ", " ", new FindReplaceOptions());
        }
    }
}

doc.updatePageLayout();
collector = new LayoutCollector(doc);
DocumentPageSplitter splitter = new DocumentPageSplitter(collector);
for (int page = 1; page <= pagecount; page++) {
    Document newDoc = splitter.GetDocumentOfPage(page);
    newDoc.save(MyDir + "output" + page + ".docx");
}

saurabh.arora · December 4, 2017, 7:02pm

Hi Tahir,

Thanks for the reply . I tried the code but the problem still persists. Please see 3.docx (splitted into 2 pages). Attaching the code and the document folder.

code.zip (6.1 KB)

test1.zip (42.4 KB)

Please suggest.

Thanks

tahir.manzoor · December 5, 2017, 7:22am

@saurabh.arora,

Thanks for your patience. Unfortunately, this issue is due to PageSplitter utility. We will inform you via this forum thread once this issue is resolved.

We apologize for your inconvenience.

saurabh.arora · December 11, 2017, 7:36am

Hi @tahir.manzoor ,

Any update. We are really struck on this.

tahir.manzoor · December 11, 2017, 2:57pm

@saurabh.arora,

Thanks for your inquiry. In your case, we suggest you please use Aspose.Words to convert each page of Word document into PDF and then use Aspose.Pdf to convert PDF document to DOC file format. Please check the following code examples. Hope this helps you.

 Document doc = new Document(MyDir + "Main.doc");
    using (Stream stream = File.Create(MyDir + "main_page3.pdf"))
    {
        PdfSaveOptions options = new PdfSaveOptions();
        options.PageIndex = 2;
        options.PageCount = 1;
        doc.Save(stream, options);
    }

var pdfDoc = new Document(dataDir + "main_page3.pdf");
var saveOptionsX = new Aspose.Pdf.DocSaveOptions
{
  Mode = DocSaveOptions.RecognitionMode.Flow,
  Format = DocSaveOptions.DocFormat.DocX,
};
pdfDoc.Save(dataDir + "main_page3.docx", saveOptionsX);

saurabh.arora · December 13, 2017, 6:27am

Hi Tahir,

Thanks for the reply. This is a sort of hack which will result in performance issues. Are we looking to solve our Page Splitting utility.

tahir.manzoor · December 13, 2017, 3:29pm

@saurabh.arora,

Thanks for your inquiry. Please note that MS Word document is flow document and does not contain any information about its layout into lines and pages. Therefore, technically there is no “Page”, “Line” concept in Word document. Pages and lines are created by Microsoft Word on the fly so sometimes it’s hard to achieve 100% exact page layout.

If you do not want to use the suggested solution in my previous post, you can workaround this issue by using following code.

Document doc = new Document(MyDir + "Main.docx");

NodeCollection paragraphs = doc.getChildNodes(NodeType.PARAGRAPH, true);
LayoutCollector collector = new LayoutCollector(doc);

for (Paragraph paragraph : (Iterable<Paragraph>) paragraphs) {
    for (Run run : paragraph.getRuns())
    {
        if(collector.getStartPageIndex(run) != collector.getEndPageIndex(run))
        {
            doc.getRange().replace(ControlChar.NON_BREAKING_SPACE, " ", new FindReplaceOptions());
            doc.getRange().replace(" ", " ", new FindReplaceOptions());
        }
    }
}

doc.updatePageLayout();
int pagecount = doc.getPageCount();

collector = new LayoutCollector(doc);
DocumentPageSplitter splitter = new DocumentPageSplitter(collector);

for (int page = 1; page <= pagecount; page++) {
    Document newDoc = splitter.GetDocumentOfPage(page);
    if(newDoc.getFirstSection().getBody().getFirstParagraph().toString(SaveFormat.TEXT).trim().length() == 0)
    {
        newDoc.getFirstSection().getBody().getFirstParagraph().remove();
    }
    newDoc.save(MyDir + "output" + page + ".docx");
}

saurabh.arora · April 10, 2018, 7:10am

Hi Tahir,

I tried splitting the word document via pdf route. But it is giving me exception for large documents. Please find my code :

public class WordSplitViaPdf {

public static void main(String... args) throws Exception {

    com.aspose.pdf.License license = new com.aspose.pdf.License();
    com.aspose.words.License wordLicense = new  com.aspose.words.License();
    license.setLicense(new java.io.FileInputStream("/home/sauravarora/Desktop/aspose-licence"));
    wordLicense.setLicense("/home/sauravarora/Desktop/aspose-licence");

    PdfSaveOptions options = new PdfSaveOptions();
    com.aspose.pdf.DocSaveOptions docSaveOptions = new com.aspose.pdf.DocSaveOptions();
    docSaveOptions.setMode(com.aspose.pdf.DocSaveOptions.RecognitionMode.Flow);
    docSaveOptions.setFormat(com.aspose.pdf.DocSaveOptions.DocFormat.DocX);

    Document document = new Document("/home/sauravarora/data/1005/bulkupload/contracttemplate/1404/1523299617685/docWithoutContent.docx");
    String folder = "/home/sauravarora/data/1005/bulkupload/contracttemplate/1404/1523299617685";

    for (int page = 1; page <= document.getPageCount(); page++) {
        options.setPageIndex(page - 1);
        options.setPageCount(1);
        FileOutputStream fileOutputStreamForPdf = new FileOutputStream(folder + "/" + page + ".pdf");
        document.save(fileOutputStreamForPdf, options);
        FileOutputStream fileOutputStreamForWord = new FileOutputStream(folder + "/" + page + ".docx");
        FileInputStream fileInputStreamForPdf = new FileInputStream(folder + "/" + page + ".pdf");
        com.aspose.pdf.Document pdfDocument = new com.aspose.pdf.Document(fileInputStreamForPdf);
        pdfDocument.save(fileOutputStreamForWord, docSaveOptions);
    }
}

}

Also the exception trace :

Exception in thread “main” java.lang.IllegalStateException: Infinite loop detected.
at com.aspose.words.zzYX9.zzAh(Unknown Source)
at com.aspose.words.zzYX9.zzRE(Unknown Source)
at com.aspose.words.zz9E.zzXv(Unknown Source)
at com.aspose.words.zz9F.zz5E(Unknown Source)
at com.aspose.words.zzZN8.zz5E(Unknown Source)
at com.aspose.words.zz1V.zzZWW(Unknown Source)
at com.aspose.words.zz1V.zzZ(Unknown Source)
at com.aspose.words.zz1V.zzZG(Unknown Source)
at com.aspose.words.Document.zzZ(Unknown Source)
at com.aspose.words.Document.zzZ(Unknown Source)
at com.aspose.words.Document.zzZ(Unknown Source)
at com.aspose.words.Document.save(Unknown Source)
at com.aspose.words.examples.cellsexamples.WordSplitViaPdf.main(WordSplitViaPdf.java:30)

I am attaching the document for your reference. Please help.docWithoutContent.docx.zip (12.6 KB)

tahir.manzoor · April 10, 2018, 11:33am

@saurabh.arora,

Thanks for your inquiry. We have answered your query here in this post. Please follow that thread for further proceedings.

saurabh.arora · April 10, 2018, 12:29pm

Hi Tahir,

Thanks for the reply. I was just curious weather the aspose team has resolved the problem with word spitting code. The pdf approach is not optimum and leads to performance issue. I would be glad if we could solve problem with the word spitting code.

Thanks,
Saurabh

tahir.manzoor · April 10, 2018, 3:49pm

@saurabh.arora,

Thanks for your inquiry. We logged a feature request in our issue tracking system to provide a built-in method in Aspose.Words to split documents into pages. The ID of this issue is WORDSNET-16228. Your thread has been linked to this issue and you will be notified via this thread as soon as this issue is resolved. We apologize for the inconvenience.

aspose.notifier · October 25, 2020, 7:31am

The issues you have found earlier (filed as WORDSNET-16228) have been fixed in this Aspose.Words for .NET 20.10 update and this Aspose.Words for Java 20.10 update.