Issues with parts of paragraphs splitting up and moving to different pages

We are implementing functionality to take specific pages from Word documents, and turning them into new Word Documents, using the Aspose Library, Version 18.11

We found the PageSplitter code within the Aspose Github repository, and have adapted it in our own code. https://reference.aspose.com/words/java/com.aspose.words/Document#extractPages(int,int)

Unfortunately there are some documents which, when manipulated using Aspose, have parts of paragraphs shifted to subsequent pages.

As an example, please see the original file, a 2-page document called “example.doc”, attached (please note it is within the Archive.zip)

We run the below code against “example.doc”. To summarise, it is supposed to create a new document, made up of the pages from the original document specified in the List pages list. (I can produce a version of this code that can be compiled upon request).

    String extractPages(ClippingRequestMessage requestMessage, File originalFile, List<Integer> pages) throws Exception {
      Document document = new Document(originalFile.getAbsolutePath());
      DocumentPageSplitter splitter = new DocumentPageSplitter(document);

      Document tempDocument = new Document();

      int counter = 0;
      sort(pages);
      for (int pageNum : pages) {
        counter++;
        if (counter == 1) {
          tempDocument = splitter.getDocumentOfPage(pageNum);
          cleanUpDocument(tempDocument);
        } else {
          Document tempNewDocument = splitter.getDocumentOfPage(pageNum);
          tempDocument.appendDocument(tempNewDocument, ImportFormatMode.KEEP_SOURCE_FORMATTING);
        }
      }

      String tempClippedDocument = TempFileHelper.getTempClippedFile(requestMessage);
      tempDocument.save(tempClippedDocument);
      return tempClippedDocument;
    }

If we call the above function asking for pages 1 and 2 of the original document, the resulting document is attached within Archive.Zip: “ example-clipped.doc ”. Comparing “example.doc” and “example-clipped.doc”, you can see that the last paragraph on page 1 of “example.doc” is split up in “example-clipped.doc”, with parts of the paragraph appearing in both pages 1 and 2.

This is a problem for us because we expect the new document to retain its original format and paragraph structure, even if it is only a subset of pages of the original document.

We would greatly appreciate some help on retaining the original structure of the word document and the paragraphs within, even when splitting the document.

Additionally , we notice that this splitting of paragraphs also occurs when converting a Word document to a pdf using Aspose.

Using the below code against “example.doc”:

    String originalFilePath = originalFile.getAbsolutePath();
    Document document = new Document(originalFilePath);
    String directory = Paths. *get* (originalFilePath).getParent().toString();
    PdfSaveOptions options = new PdfSaveOptions();
    options.setPreserveFormFields(true);
    options.setUseCoreFonts(true);
    options.setJpegQuality(10);
    options.setExportDocumentStructure(true);
    String tempPdfFilePath = directory + File. *separator* + *TEMP_PDF* ;
    options.setCompliance(PdfCompliance. *PDF_15* );
    document.save(tempPdfFilePath, options);

The resulting file, “example-clipped. pdf ” (also in Archive.zip), also has the split-paragraph problem, as can be seen at the end of page 1 of the document.

Would also appreciate assistance on this issue, as it may be linked to the original problem to do with splitting the document file.

Kind regards,

Sunny
Archive.zip (34.5 KB)

@sjunejo

Thanks for your inquiry. We are investigating this issue and will get back to you soon.

@sjunejo

We have tested the scenario using the latest version of Aspose.Words for Java 19.2 and have not found the shared issue. Please check the attached output documents. So, please use Aspose.Words for Java 19.2.
Docs.zip (77.4 KB)

@tahir.manzoor

Hi Tahir,

Many thanks for the quick response.

Unfortunately, looking at the files in Docs.zip, the issue still persists.

The screenshot below “original_doc.png” is from the original “example.doc”.

“new_pdf.png” shows the resulting pdf, and “new_doc_page_2.png” is from the new document file.

Please note how page 2 of the new files has a bit of the paragraph from the original page 1.

Any other information required, please let me know.

original_doc.png (564.4 KB)
new_pdf.png (231.9 KB)
new_doc_page_2.png (205.9 KB)

Kind regards

Sunny

@sjunejo

Thanks for sharing the detail. Please check the attached image of MS Word 2016. The output PDF and extracted word documents are according to MS Word 2016 output. Could you please share the MS Word version that you are using?

@tahir.manzoor

Hi Tahir,

Thanks for the reply. We are using Microsoft Word version 16.22.

I can see the same issue in the screenshot you posted - part of the paragraph from the first page is at the top of the second page. Is there a way to stop this from happening when using Aspose? Does the original document have to be converted to a different format first?

@sjunejo

Thanks for sharing the detail. The following text is at the top of second page in Word document. MS Word renders it on the second page. Please check the attached image.

Pellentesque ac ante volutpat, varius velit vel, sagittis neque. Vestibulum quis felis et tortor volutpat imperdiet.

@tahir.manzoor

Thanks for the reply - This is interesting. Can you please confirm that the above screenshot is from the original file “example.doc”?

If so, any idea why the document layout would be different on your version of Microsoft Word and mine? Could there be hidden characters. etc. being rendered?

@sjunejo

Thanks for your inquiry.

Yes, the screenshot is of document “example.doc”.

Please make sure that Arial and Times New Roman fonts are installed on your system. Could you please check this document at some other system and let us know if you still face the same issue?

If you still face the issue, please share following detail for further testing?

What environment are you running on?

  • Operating System detail.
  • Architecture (32 / 64 bit)
  • Provide information about your specific culture, such as the name of the culture, language and country/region.

@tahir.manzoor

Thanks for the reply and additional info.

I have just tried the same document on another machine (a Windows machine) and the format on that machine matches your screenshot. I suppose this means the same document’s format appears differently on different machines! Any idea why this might be the case? I am aware this might be outside of the scope of Aspose support - but if it affects the results of the PageSplitter code (https://reference.aspose.com/words/java/com.aspose.words/Document#extractPages(int,int)) I think it’s important to understand the root cause, if possible.

I first tried opening the document on a MacOS High Sierra MacBook - 64-bit.

Can you please clarify what you mean by “culture”? The language I am working on the document with is English (US).

@sjunejo

Thanks for sharing the detail. Please note that Aspose.Words mimics the behavior of MS Word 2016 when rendering document to fixed page file formats e.g. PDF. The PageSplitter utility uses the same page layout of MS Word 2016 when splitting document.

Could you please share the version of Office for Mac that you are using at Mac operating system? We will investigate this issue and provide you more detail about your query.

@tahir.manzoor

Hi Tahir,

Many thanks for the information. We will keep this in mind going forward with the Aspose Words API.

We are using Office 2019 for Mac, and specifically, Microsoft Word version 16.22.

@sjunejo

Thanks for sharing the detail. Please install ‘Arial’ and ‘Times New Roman’ fonts on your Mac operating system and take the screenshot of Word document. Please share that screenshot here for further testing. Thanks for cooperation.