Compress DOCX File to reduce the size using Aspose Word

I need to compress the DOCX files which are about 500KB and reduce them to size below 100KB for translating using aws translate. With Aspose cloud API I can find compress doc methods but with Aspose Words Java I cannot find any useful compress methods for DOCX files. Please let me know the better way to achieve this. Thanks

@ikarthik25 You can use Document.cleanup method and specify OoxmlSaveOptions.CompressionLevel to reduce the output DOCX document size.
In additional you can reduce the images quality in your document. You can use the approach like the following:

Document doc = new Document(@"C:\Temp\in.docx");
// Get shapes with images from the document.
// Loop though the image shapes
for (Shape s : (Iterable<Shape>)doc.getChildNodes(NodeType.SHAPE, true))
{
    if (!s.hasImage())
        continue;

    // Get the original image bytes.
    byte[] imageBytes = s.getImageData().getImageBytes();

    // process the image as it is required.
    // .................

    s.getImageData().setImageBytes(processedImageBytes);
}
// Save the output.
doc.save("C:\\Temp\\out.docx");

Thanks @alexey.noskov for your response. I made changes as suggested by you and also added few more inside the process image part which is below. The docx is getting compressed and can able to open but the AWS Translate API is throwing error for this compressed docx file.

Compression Method:

public static byte[] compressImage(byte[] imageBytes, float quality) throws IOException {
    BufferedImage image = ImageIO.read(new ByteArrayInputStream(imageBytes));

    ByteArrayOutputStream byteArrayOutputStream = new ByteArrayOutputStream();
    Iterator<ImageWriter> writers = ImageIO.getImageWritersByFormatName("jpg");

    if (!writers.hasNext()) {
        throw new IllegalStateException("No writers found for JPG format");
    }

    ImageWriter writer = writers.next();
    ImageOutputStream imageOutputStream = ImageIO.createImageOutputStream(byteArrayOutputStream);
    writer.setOutput(imageOutputStream);

    ImageWriteParam param = writer.getDefaultWriteParam();
    param.setCompressionMode(ImageWriteParam.MODE_EXPLICIT);
    param.setCompressionQuality(quality); // Compression quality (0.0 to 1.0)

    writer.write(null, new javax.imageio.IIOImage(image, null, null), param);
    imageOutputStream.close();
    writer.dispose();

    return byteArrayOutputStream.toByteArray();
}

Error:

2025-03-13 09:58:26 ERROR [http-nio-5001-exec-1] c.l.p.t.s.ProcessDocxServiceImpl - Error reading or writing file: The Request Is Invalid (Service: Translate, Status Code: 400, Request ID: 0d452be9-7a85-4f2e-bb9c-52086db96d79)

Looks like the AWS translate is having some issues with the compressed DOCX files.

@ikarthik25 Unfortunately, the provided error message does not give any useful information that tells what is the problem with the document. I do not think the problem is related to Aspose.Words. Have you tried contacting AWS translate support to determine the problem with the provided document.

For testing purposes, Please try to remove all shapes from the document. Will AWS translate handle the document without images at all?

your right the error message is not useful which is sent by the AWS translate. But only in this document I am seeing this issue. I have tested few other docx files like this with embedded images in them and they are getting their sizes reduced and translate works fine with them. Let me try with some more DOCX with embedded images and of size like 500KB and then come back with results. Will be back with the results

1 Like

@alexey.noskov When I use the below code using Aspose Cloud API for compressing the DOCX files the translate works fine without issues but when the other code which I posted early using using the Aspose Words Jar and the image compression the translate is throwing the error. Do we have similar kind of compressOption classes which is exposed in the cloud API in the standalone Aspose words Jar. I am not able to find one ?

Aspose Cloud API Method:

import com.aspose.words.cloud.ApiClient;
import com.aspose.words.cloud.api.WordsApi;
import com.aspose.words.cloud.model.CompressOptions;
import com.aspose.words.cloud.model.requests.CompressDocumentOnlineRequest;
import com.aspose.words.cloud.model.responses.CompressDocumentOnlineResponse;

public static void compressDocumentApi(String inputDocPath) {
        ApiClient apiClient = new ApiClient(AsposeLicenseConfig.clientId, AsposeLicenseConfig.clientSecret, null);
        WordsApi wordsApi = new WordsApi(apiClient);
        try {
            // Compress the document
            String outputPath = inputDocPath + "_compressed.docx";
            byte[] requestDocument = Files.readAllBytes(Paths.get(inputDocPath).toAbsolutePath());
            CompressOptions requestCompressOptions = new CompressOptions();
            requestCompressOptions.setImagesQuality(25);
            requestCompressOptions.setImagesReduceSizeFactor(2);

            CompressDocumentOnlineRequest compressDocumentRequest = new CompressDocumentOnlineRequest(
                    requestDocument, requestCompressOptions, null, null, null, null,  outputPath );
            CompressDocumentOnlineResponse compressDocumentResponse = wordsApi.compressDocumentOnline(compressDocumentRequest);
            byte[] docxBytes = compressDocumentResponse.getDocument().get(compressDocumentResponse.getDocument().keySet().iterator().next());
            try (FileOutputStream fos = new FileOutputStream(inputDocPath)) {
                fos.write(docxBytes);
            }

        } catch (Exception e) {
            logger.error("Error compressing document: {}", e.getMessage());
        }
    }

@ikarthik25 The is no built-in method for document compression in Aspose.Words for Java. I consulted with Aspose.Words for Cloud team and their compress method actually compress images in the document.

Could you please attach your input document and outputs produced by cloud and java versions of Aspose.Words? We will check the difference and provide you more information.

Compress-files.zip (3.0 MB)
In the attached Zip file I have added the required documents.

  1. OCR-Document.docx - The Input DOCX file which need to be compressed
  2. oversized-files - The folder which contains the oversized (above 100KB) documents which are splitted from the Input DOCX.
  3. working-compressed-docs-aspose-api - Folder which contains the compressed DOCS which are working in translate.
  4. not-working-compressed-docs-aspose-jar - Folder which contains the compressed DOCS (using Aspose Jar) not working in translate.

Please let me know the difference in these two compressed modes. Thanks !!!

@ikarthik25 Do you perform any other actions with the document in your code except image resizing? I see measurement units in your JAR output document are points, while in the original they are emus. Also tags and attribute values are renamed:
Original:

<w:tblPr>
  <w:jc w:val="left" />
  <w:tblLayout w:type="fixed" />
  <w:tblCellMar>
    <w:top w:w="0" w:type="dxa" />
    <w:left w:w="0" w:type="dxa" />
    <w:bottom w:w="0" w:type="dxa" />
    <w:right w:w="0" w:type="dxa" />
  </w:tblCellMar>
</w:tblPr>

Problematic:

<w:tblPr>
  <w:jc w:val="start" />
  <w:tblLayout w:type="fixed" />
  <w:tblCellMar>
    <w:top w:w="0pt" w:type="dxa" />
    <w:start w:w="0pt" w:type="dxa" />
    <w:bottom w:w="0pt" w:type="dxa" />
    <w:end w:w="0pt" w:type="dxa" />
  </w:tblCellMar>
</w:tblPr>

Looks like you have specified OoxmlCompliance.ISO_29500_2008_STRICT in OoxmlSaveOptions. Please try using default value of OoxmlSaveOptions.Compliance. Probably the AWS translate does not like ISO_29500_2008_STRICT DOCX documents.

you’re right. The Ooxml Compliance is the issue. Removing it worked :slight_smile: . Will test few more documents with similar types but I think I am good now. Issue is resolved. Thanks @alexey.noskov for all the help.

1 Like