Aspose split docx file wrongly in Docker, but works fine in local env

I am using Aspose to split a DOCX file into separate DOCX files for each page, and I am creating a Python Azure Function to perform this task. The splitting works correctly in both an Ubuntu Docker image and my local environment. However, when running in the Azure Functions Docker image, the last sentence of each page is incorrectly moving to the next page.

Before split :

After Split :
page 11 :

Here the “Please complete the following table which splits out the main sources of alpha for this strategy:” sentence is missing from the page 11 and it is appearing in the 12th Page.

Docker image I used for Azure-function

FROM mcr.microsoft.com/azure-functions/python:4-python3.11
# Set noninteractive frontend to suppress apt warnings
ENV DEBIAN_FRONTEND=noninteractive

# Enable contrib repository and install ttf-mscorefonts-installer
RUN apt-get update && \
    apt-get install -y --no-install-recommends software-properties-common && \
    add-apt-repository contrib && \
    apt-get update && \
    echo "msttcorefonts msttcorefonts/accepted-mscorefonts-eula select true" | debconf-set-selections && \
    apt-get install -y --no-install-recommends ttf-mscorefonts-installer && \
    apt-get clean && \
    rm -rf /var/lib/apt/lists/*
RUN apt-get update && apt-get install -y \
    ttf-mscorefonts-installer \
    libfontconfig1 \
    libgdiplus
RUN apt update && apt upgrade -y
RUN apt install -y wget
RUN wget http://archive.ubuntu.com/ubuntu/pool/main/i/icu/libicu66_66.1-2ubuntu2_amd64.deb
RUN dpkg -i ./libicu66_66.1-2ubuntu2_amd64.deb
RUN wget http://security.ubuntu.com/ubuntu/pool/main/o/openssl/libssl1.1_1.1.0g-2ubuntu4_amd64.deb
RUN dpkg -i ./libssl1.1_1.1.0g-2ubuntu4_amd64.deb
RUN wget http://ftp.us.debian.org/debian/pool/main/libm/libmspack/libmspack0_0.11-1_amd64.deb
RUN dpkg -i ./libmspack0_0.11-1_amd64.deb
RUN wget http://ftp.us.debian.org/debian/pool/main/c/cabextract/cabextract_1.9-3_amd64.deb
RUN dpkg -i ./cabextract_1.9-3_amd64.deb
RUN wget http://ftp.us.debian.org/debian/pool/contrib/m/msttcorefonts/ttf-mscorefonts-installer_3.8.1_all.deb
RUN dpkg -i ./ttf-mscorefonts-installer_3.8.1_all.deb
RUN apt --fix-broken install -y
 
RUN rm -i libssl1.1_1.1.0g-2ubuntu4_amd64.deb
RUN wget -q https://packages.microsoft.com/config/debian/11/packages-microsoft-prod.deb
RUN dpkg -i packages-microsoft-prod.deb
RUN apt update
RUN apt install azure-functions-core-tools-4
RUN rm -i packages-microsoft-prod.deb
 
RUN apt-get update && apt-get install -y curl apt-transport-https lsb-release gnupg
 
# Install Azure CLI
RUN curl -sL https://aka.ms/InstallAzureCLIDeb | bash

######
ENV AzureWebJobsScriptRoot=/home/site/wwwroot \
    AzureFunctionsJobHost__Logging__Console__IsEnabled=true
 
COPY requirements.txt /
RUN pip install -r /requirements.txt
 
COPY . /home/site/wwwroot
WORKDIR /home/site/wwwroot

In the above docker file I am getting the issue… But when I am using the ubuntu image I am not getting the issue…

Docker file with ubuntu image:

FROM ubuntu:22.04
RUN apt update && apt install -y python3.11
RUN echo ttf-mscorefonts-installer msttcorefonts/accepted-mscorefonts-eula select true | debconf-set-selections
RUN apt install -y ttf-mscorefonts-installer
RUN apt install -y python3-pip
RUN apt install -y wget
RUN python3.11 -m pip install pillow
RUN python3.11 -m pip install --upgrade pip
RUN update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.10 1
RUN update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.11 2
RUN update-alternatives --auto python3
RUN wget http://security.ubuntu.com/ubuntu/pool/main/o/openssl/libssl1.1_1.1.0g-2ubuntu4_amd64.deb
RUN dpkg -i ./libssl1.1_1.1.0g-2ubuntu4_amd64.deb
RUN rm -i libssl1.1_1.1.0g-2ubuntu4_amd64.deb
RUN wget -q https://packages.microsoft.com/config/ubuntu/22.04/packages-microsoft-prod.deb
RUN dpkg -i packages-microsoft-prod.deb
RUN apt update
RUN apt install azure-functions-core-tools-4
RUN rm -i packages-microsoft-prod.deb

RUN apt-get update && apt-get install -y curl apt-transport-https lsb-release gnupg

# Install Azure CLI
RUN curl -sL https://aka.ms/InstallAzureCLIDeb | bash

######
ENV AzureWebJobsScriptRoot=/home/site/wwwroot \
    AzureFunctionsJobHost__Logging__Console__IsEnabled=true

COPY requirements.txt /
RUN pip install -r /requirements.txt

COPY . /home/site/wwwroot
WORKDIR /home/site/wwwroot

@sachithaPDF Most likely the problem on your side occurs because fonts required for building document layout are not available in your environment. To build an accurate document layout the fonts are required. If Aspose.Words cannot find the fonts used in the document the fonts are substituted . This might lead into the document layout difference due to difference in font metrics.
Please see our documentation to learn where Aspose.Words looks for fonts:
https://docs.aspose.com/words/python-net/specifying-truetype-fonts-location/

The document contains only the basic font ‘Arial’ and I have installed the ttf-mscorefonts-installer, but the issue still occurs. After conducting some tests, I found that the same issue sometimes occurs in the Ubuntu image as well. It seems like the splitting functionality only works correctly in the Windows environment. Any idea how to fix this issue?

@sachithaPDF Most likely the fonts are not accessible. Please try putting the required fonts into a separate folder and use this folder as font source as described in our documentation:
https://docs.aspose.com/words/python-net/specifying-truetype-fonts-location/

Seems like the issue was resolved after this, but we need to run more test cases to verify it. Now we are encountering another issue with the numbering. The following image shows the 15th page of the original document before splitting. After splitting, the question number is missing.

Before splitting:

After Splitting:

import aspose.words as aw

license = aw.License()

# Set the license file path
license_file_path = "lic path"
license.set_license(license_file_path)
doc = aw.Document("docx path")
for page in range(18):
    page_number = page + 1
    extractedPage = doc.extract_pages(page, 1)
    extract_page_name = f"{page+1}.docx"
    extractedPage.save(extract_page_name)

When I merge them back, all the numbering changes. The number highlighted below should be 23 instead of 16. I have attached the merged document here so you can go through it and understand the issue.

import aspose.words as aw

license = aw.License()

# Set the license file path
license_file_path = "lic path"
license.set_license(license_file_path)
doc = aw.Document("test.docx")
for page in range(4):
    page_number = page + 1
    extractedPage = doc.extract_pages(page, 1)
    extract_page_name = f"{page+1}.docx"
    extractedPage.save(extract_page_name)
docx_file_paths = ["1.docx", "2.docx", "3.docx", "4.docx"]
merged_document = aw.Document()
merged_document.remove_all_children()
for fileName in docx_file_paths:
    input = aw.Document(fileName)
    merged_document.append_document(input, aw.ImportFormatMode.KEEP_SOURCE_FORMATTING)
merged_document.save("merged2.docx")

original file : test.docx
test.docx (45.1 KB)

Merged file:
merged2.docx (22.6 KB)

@sachithaPDF
We have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.

Issue ID(s): WORDSNET-27799

You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.

PS: pratially the problem can be resolved by specifying ImportFormatOptions.keep_source_numbering property while merging documents back.

1 Like

Thanks for the update. The ‘keep_source_numbering’ worked for the document mentioned above but did not work correctly with other documents I tested. I added all the fonts available in my Windows environment and configured the font directory as per the documentation. However, some lines are still being moved to the wrong pages during splitting.

@sachithaPDF As it was mentioned above, MS Word documents are flow by their nature. Splitting flow document into pages is quite complex test and depends on many factors. If possible, please attach the problematic document here for testing. We will check it and provide you more information.

Thank you for the quick response. I will take steps on my end to handle the line-shifting issue. However, I have another issue when merging split documents back into a single document. It changes the numbering format. For the document mentioned above, it retains the numbers perfectly except for the missing one. However, with the keep_source_numbering option enabled, another document failed to maintain the question numbering correctly during the merging process. Interestingly, when I removed that option, it worked fine . I’ll add a zip file containing docx files.
doc 1.zip (523.6 KB)

@sachithaPDF
We have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.

Issue ID(s): WORDSNET-27801

You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.

1 Like

@sachithaPDF We completed analyzing WORDSNET-27801 and concluded this is Not a Bug.
When option KeepSourceNumbering is set to true, Aspose.Words clones list from the source document with a new list definition identifier, if list with the same identifier already exists in a destination document.

The changed list item has

<w:nsid w:val="1DBE271F" />

in the source 3.docx document. And there is list definition with exactly the same identifier in destination 1.docx document. So when the option is set to true, the source list with its definition was cloned with a new identifier by design.

For the moment, you should not use this KeepSourceNumbering option in this case to achieve desired result. Note, this option has no analogue in Word.