Extract Content Page wise in the document

ankitagupta17 · November 24, 2023, 6:37am

Hi, I have a docx file (please see the attachment) and wants to extract content based on page number? For example, if I wanted to extract the content from pages 1 and 3?
I am aware about the fact that “Word document is flow document and does not contain any information about its layout into lines and pages. Therefore, technically there is no “Page” concept in Word document. Aspose.Words uses our own Rendering Engine to layout documents into pages.”

So, have used DocumentPageSplitter to address this. But, still some of the data is getting copied from the adjacent pages. Please find below the actual document and the extracted document with Pages 1 and 3.
Note: Aspose word version used: name: ‘aspose-words’, version: ‘23.4’
Language used: Java
extracted Sample test doc.docx (15.7 KB)

Sample test doc.docx (54.2 KB)

Code piece:

String extractPages(ClippingRequestMessage requestMessage, File originalFile, List<Integer> pages) throws Exception {
    Document document = new Document(originalFile.getAbsolutePath());
    log.info("Document loaded: {}", originalFile.length());
    log.info("Total page numbers: {}", document.getPageCount());

    DocumentPageSplitter splitter = new DocumentPageSplitter(document);

    Document tempDocument = new Document();
    int counter = 0;
    sort(pages);
    for (int pageNum : pages) {
      counter++;
      if (counter == 1) {
        tempDocument = splitter.getDocumentOfPage(pageNum);
        cleanUpDocument(tempDocument);
      } else {
        Document tempNewDocument = splitter.getDocumentOfPage(pageNum);
        cleanUpDocument(tempNewDocument);
        tempDocument.appendDocument(tempNewDocument, ImportFormatMode.KEEP_DIFFERENT_STYLES);
      }
    }
    log.info("Document created");

    String tempClippedDocument = TempFileHelper.getTempClippedFile(requestMessage);
    tempDocument.save(tempClippedDocument);
    return tempClippedDocument;
  }

I tried upgrading the version to the latest 23.11 but it does not support “DocumentPageSplitter”. Is there any other alternate to this which fixes the issue as well supports the latest version of aspose. Can you please help on this.

Thank you!

alexey.noskov · November 24, 2023, 8:12am

@ankitagupta17 DocumentPageSplitter was a custom class used to split document page by page. Now, you do not need DocumentPageSplitter to extract pages from the document. There is a built-in method Document.extractPages that does exactly what you need.

ankitagupta17 · November 24, 2023, 10:17am

Hi @alexey.noskov, Thanks for the quick response.

I tried implementing " Document.extractPages " as suggested, still the issue persists. Let me know if there any changes which should be done in the code.
PFB the code snippet:

String extractPages(ClippingRequestMessage requestMessage, File originalFile, List<Integer> pages) throws Exception {
    Document document = new Document(originalFile.getAbsolutePath());
    log.info("Document loaded: {}", originalFile.length());
    log.info("Total page numbers: {}", document.getPageCount());

    Document tempDocument = new Document();

    for (int pageNum : pages) {
      log.info("Extracted page number-->"+pageNum);
      if (pageNum > 0 && pageNum <= document.getPageCount()) {
        log.info("Extracted page from main doc --->"+pageNum+" "+document.getText());
        tempDocument = document.extractPages(pageNum - 1, 1);
        log.info("Extracted page text --->"+pageNum+" "+tempDocument.getText());
      }
    }
    log.info("Out of for loop");
    // Clean up the extracted document if needed
    cleanUpDocument(tempDocument);

    log.info("Document created");

    String tempClippedDocument = TempFileHelper.getTempClippedFile(requestMessage);
    tempDocument.save(tempClippedDocument);
    return tempClippedDocument;
  }

Also, tried with another sample document.
file-sample_500kB.docx (42.1 KB)

Thank you!

alexey.noskov · November 24, 2023, 11:29am

@ankitagupta17 Please use the following code to extract the required pages from the document:

private static Document getPages(Document src, List<Integer> pages) throws Exception
{
    // Create the target document.
    Document result = (Document)src.deepClone(false);
        
    // Append the required pages to the targer document.
    for(int pageIndex : pages)
        result.appendDocument(src.extractPages(pageIndex, 1), ImportFormatMode.USE_DESTINATION_STYLES);
        
    return result;
}

ankitagupta17 · November 24, 2023, 12:50pm

Hi @alexey.noskov, I used the code snippet shared by you.

Still, the content is not coming properly. Let me know if I have missed on anything.

String extractPages(ClippingRequestMessage requestMessage, File originalFile, List<Integer> pages) throws Exception {
    Document document = new Document(originalFile.getAbsolutePath());
    log.info("Document loaded: {}", originalFile.length());
    log.info("Total page numbers: {}", document.getPageCount());
    // Create the target document.
    Document tempDocument = (Document)document.deepClone(false);
    log.info("Document result: {}", tempDocument);
    // Append the required pages to the targer document.
    for(int pageIndex : pages) {
      tempDocument.appendDocument(document.extractPages(pageIndex-1, 1), ImportFormatMode.USE_DESTINATION_STYLES);
      log.info("extracted data  "+pageIndex+" "+tempDocument.getText());
    }
    log.info("Document created");

    String tempClippedDocument = TempFileHelper.getTempClippedFile(requestMessage);
    tempDocument.save(tempClippedDocument);
    return tempClippedDocument;
  }

Screenshot for reference. The highlighted part is coming on the next page after downloading.

Thank you!

alexey.noskov · November 24, 2023, 12:57pm

@ankitagupta17 The problem might occur because fonts used in your document are not available in the environment where the document is processed. The fonts are required to build document layout. If Aspose.Words cannot find the font used in the document, the font is substituted. This might lead into fonts mismatch and document layout and incorrect page detection due to the different fonts metrics. You can implement IWarningCallback to get notifications when font substitution is performed.
Please see our documentation to learn where Aspose.Words looks for fonts:
https://docs.aspose.com/words/net/specifying-truetype-fonts-location/

ankitagupta17 · November 28, 2023, 8:18am

Hi @alexey.noskov, can you give some idea on “How is extractPages() working internally to split the pages in aspose” because I tried checking the font behavior but that is also not helping much.

So, just to fix the issue, I was thinking of an approach to convert Word file to pdf, extract the pages and convert it back to word. Let me know if this could actually solve the issue and will it have any performance issues.
Or can you suggest me some other approach to help me get this resolved.

Thank you!

alexey.noskov · November 28, 2023, 8:34am

@ankitagupta17 To split document into pages Aspose.Words build document layout internally and extracts the content of the required pages using the layout information. The same document layout engine is used as for building document layout for saving document to PDF or any other fixed page document formats.

I am afraid such approach will not work well. MS Word documents are flow documents and they have structure very similar to Aspose.Words Document Object Model . On the other hand PDF documents are fixed page format documents . While conversion PDF document to MS Word document Fixed Page Document structure into the Flow Document Object Model. Unfortunately, such conversion does not guaranty 100% fidelity. So it is not always possible to retain PDF document layout to MS Word document. And definitely such conversion will not retain the original word document structure.

ankitagupta17 · November 28, 2023, 11:10am

@alexey.noskov Thanks for the quick response!

Could you please suggest an alternative approach to help me resolve this issue?

alexey.noskov · November 28, 2023, 1:10pm

@ankitagupta17 I am afraid, the only way to split the document into pages is Document.extractPages method. As I can see the method works fine with your documents on my side. So it is not quite clear what the problem is. If possible please elaborate the problem in more details and provide full code and required data to reproduce the problem. We will check the issue once again and provide you more information.

ankitagupta17 · December 4, 2023, 8:41am

Hi @alexey.noskov, I have tried to narrowed the problem statement a bit.
So, basically, I am using the code snippet of extractPages() at 2 different places and the pageCount() is coming different for the same file. So, I believe when the pageCounts will be same at both the places, it would fix the issue.

Snippet 1:

public String extractPages(RequestMessage requestMessage, File originalFile, List<Integer> pages) throws Exception {
    Document document = new Document(originalFile.getAbsolutePath());
    log.info("Total page numbers: {}", document.getPageCount());
}

Payload: RequestMessage(Id=d0149b, userId=[email], pageNumbers=[1, 2], documentId=ea29357, sourceObjectKey=ea2935a57/1/ea298657.docx)
I also tried changing the file type to docx instead of “.dat”, but that also didn’t work out.

The output is: Total page numbers: 15

Snippet 2:

 public String processFileMetaData(RequestQueueItem requestQueueItem) {

      File originalDocument = getOriginalDocument(requestQueueItem);

      Document document = new Document(originalDocument.getAbsolutePath());
      log.info("Total pages (Word): {}", document.getPageCount());
}

The output is: Total pages (Word): 13

Note: The path of the original file is exactly the same in both the cases which is AWS S3 bucket.
In Snippet1, I am using original file directly.
In Snippet2, the original file location is fetched from the AWS SQS.
But, I believe that should not have an impact, since the source bucket is same.

File used:
Sample test doc.docx (54.2 KB)

Can you please help me on this.

Thanks and regards!

alexey.noskov · December 4, 2023, 9:09am

@ankitagupta17 The difference might occur because different fonts are available in the environments where the code is executed. Could you please implement IWarningCallback to get notifications when font substitution is performed?

ankitagupta17 · December 4, 2023, 2:39pm

Hi @alexey.noskov, as suggested by you. I have implemented IWarningCallback and the method is getting called which means that the fonts are not available in the environment where is is getting executed.

I have made some modifications in the docker file to include the fonts on the environment which fixed the problem if the document is having the font type: “Times New Roman” but failing for “Cambria”

Below is the code snippet:

String extractPages(RequestMessage requestMessage, File originalFile, List<Integer> pages) throws Exception {
    for (FontSourceBase src : FontSettings.getDefaultInstance().getFontsSources())
    {
      for (PhysicalFontInfo fontInfo : src.getAvailableFonts())
      {
        log.info("Full Name --->"+fontInfo.getFullFontName());
        log.info("Font Name --->"+fontInfo.getFontFamilyName());
        log.info("File Path --->"+fontInfo.getFilePath());
      }
    }
    Document document = new Document(originalFile.getAbsolutePath());
    document.setWarningCallback(new FontSubstitutionWarningCollector());
}

public class FontSubstitutionWarningCollector implements IWarningCallback {

    public void warning(WarningInfo info) {
        if (info.getWarningType() == WarningType.FONT_SUBSTITUTION)
            System.out.println(info.getDescription());
    }
}

Logs:

2023-12-04 10:58:35.715 INFO 7 --- [ msgHandler - 2] Words : Total page numbers: 13
Font 'Cambria' has not been found. Using 'Noto Sans Mono' font instead. Reason: font info substitution.
2023-12-04 10:58:35.433 INFO 7 --- [ msgHandler - 2] Words : Document loaded: 55494

Our original documents used are in “Cambria”.
Can you please confirm if Aspose word doesn’t support “Cambria” font yet. and if there is any plan in upcoming release to include this as font type.

Thank you so much for your constant support and help @alexey.noskov.
Much appreciated!

alexey.noskov · December 4, 2023, 2:49pm

@ankitagupta17 Aspose.Words supports Cambia font. But font must be physically available in the environment where the document is processed.

ankitagupta17 · December 4, 2023, 7:36pm

Hi @alexey.noskov , I have used the below code to install all the fonts on the environment but it still substitutes font in case of “Cambria”.

RUN apt-get update && \
    apt-get install -y awscli wget zip python3-pip xfonts-utils ca-certificates cabextract xfonts-intl-chinese fonts-arphic-ukai fonts-arphic-uming fonts-ipafont-mincho fonts-ipafont-gothic fonts-unfonts-core && \
    wget http://ftp.de.debian.org/debian/pool/contrib/m/msttcorefonts/ttf-mscorefonts-installer_3.7_all.deb && \
    wget http://ftp.de.debian.org/debian/pool/contrib/m/msttcorefonts/msttcorefonts_3.8.tar.xz && \
    dpkg -i ttf-mscorefonts-installer_3.7_all.deb && \
    tar -xf msttcorefonts_3.8.tar.xz && \
    rm ttf-mscorefonts-installer_3.7_all.deb && \
    rm -rf msttcorefonts-3.8 msttcorefonts_3.8.tar.xz && \

Is there any other library which needs to be used?

Thanks!

alexey.noskov · December 5, 2023, 6:31am

@ankitagupta17 Is Cambria font available when you print the available fonts using the following code?

for (FontSourceBase src : FontSettings.getDefaultInstance().getFontsSources())
{
    for (PhysicalFontInfo fontInfo : src.getAvailableFonts())
    {
        log.info("Full Name --->" + fontInfo.getFullFontName());
        log.info("Font Name --->" + fontInfo.getFontFamilyName());
        log.info("File Path --->" + fontInfo.getFilePath());
    }
}

Cambria font might not be included into the font packages you have installed. Please try copying the font from Windows machine into your docker image.

ankitagupta17 · December 5, 2023, 8:36am

Hi @alexey.noskov, No, its not available. This is printed instead.

Font 'Cambria' has not been found. Using 'Noto Sans Mono' font instead. Reason: font info substitution.

2023-12-05 08:23:52.422 INFO 7 --- [ msgHandler - 2] [00] Words : Font Name --->Noto Sans Mono
2023-12-05 08:23:52.422 INFO 7 --- [ msgHandler - 2] [00] Words : Full Name --->Noto Sans Mono Regular
2023-12-05 08:23:52.422 INFO 7 --- [ msgHandler - 2] [00] Words : File Path --->/usr/share/fonts/truetype/noto/NotoSansMono-Bold.ttf

Can you please guide which particular font package should be installed.

Thanks!

alexey.noskov · December 5, 2023, 10:31am

@ankitagupta17 I am afraid, there is no font package that include Cambria font, or at least I am not aware about such font package for Linux.
An easy and quick way to get TrueType fonts on a Linux system is to copy .TTF and .TTC files from the C:\Windows\Fonts directory on a Windows machine to some directory on your Linux machine. You do not need to install or register these fonts on Linux in any way; you just need to specify the location of the fonts using the FontSettings class in Aspose.Words.