Get embedded docx file inside docx file

Hi, I’m trying to get an embedded docx file from a docx file.
in the next code i’ll show you how did i try to hold the shape of the docx(inner one):

ArrayList<Shape> docxList = new ArrayList<>();
try
{
    // Get all shapes in the document
    NodeCollection<Shape> shapes = document.getChildNodes(NodeType.SHAPE, true);

    // Loop through the shapes
    for (Shape shape : shapes)
    {
        // Check if it's an OleObject
        if (shape.getOleFormat() != null)
        {
            // Check the file type and add to the corresponding list
            String progId = shape.getOleFormat().getProgId();
            if (progId.equals("Word.Document.12"))
            {
                docxList.add(shape);
            }
            else if (progId.equals("JPEG_PROGID"))
            { // Replace with actual ProgId for .jpeg
                jpegList.add(shape);
            }
            else if (progId.equals("PNG_PROGID"))
            { // Replace with actual ProgId for .png
                pngList.add(shape);
            }
        }
    }

    for (Shape docxShape : docxList)
    {
        try
        {
            trying_docx(document, analysis_data, docxShape);
        }
        catch (Exception e)
        {
            // Handle exceptions as needed
            e.printStackTrace();
        }
    }

Now after i got the docx shape i send it to the function “trying_docx”. However, when i try to get the binary data of the shape i get java.lang.NullPointerException.

here is the function:

public void trying_docx(Document document, JSONObject config, Shape oldShape) throws Exception {
    // Check if oldShape is null
    if (oldShape == null) {
        System.out.println("Old shape is null.");
        return;
    }

    // Load the embedded DOCX file from the shape if it has an OleFormat
    OleFormat oleFormat = oldShape.getOleFormat();
    if (oleFormat == null) {
        System.out.println("Shape does not have an OLE format.");
        return;
    }

    // Retrieve the embedded DOCX data
    String entryName = "\'x0001CompObj";
    byte[] embeddedDocxData = oleFormat.getOleEntry(entryName);

    // Handle the embedded DOCX data

    // Create a temporary file to hold the embedded DOCX data
    File tempFile = File.createTempFile("embedded_docx", ".docx");
    try (FileOutputStream fileOutputStream = new FileOutputStream(tempFile)) {
        fileOutputStream.write(embeddedDocxData);
    }

    // Sanitize the embedded DOCX file
    File sanitizedFile = recursive_disarmer(tempFile, config);

    // Create a new Shape with the sanitized DOCX content
    Shape newShape = new Shape(document, ShapeType.OLE_OBJECT);
    OleFormat newOleFormat = newShape.getOleFormat();
    newOleFormat.setProgId("Word.Document.12");
    newOleFormat.setSourceFullName(sanitizedFile.getAbsolutePath());

    // Set the dimensions and position of the new shape to match the old shape
    newShape.setWidth(oldShape.getWidth());
    newShape.setHeight(oldShape.getHeight());
    newShape.setLeft(oldShape.getLeft());
    newShape.setTop(oldShape.getTop());

    // Replace the old shape with the new shape
    Node parent = oldShape.getParentNode();
    if (parent instanceof CompositeNode) {
        ((CompositeNode<?>) parent).insertBefore(newShape, oldShape);
        oldShape.remove();
    } else {
        System.out.println("Parent node is not a CompositeNode.");
    }

    // Delete the temporary files
    tempFile.delete();
    sanitizedFile.delete();
}

Is there any other way to hold the shape as docx file or maybe other technique to get the binary data?
I know that there is a way to get the binary data as image but this is not what i want

@Gal10BS You should use OleFormat.save method to get the embedded file from O:E object. Please try using the following code:

ByteArrayOutputStream tmpStream = new ByteArrayOutputStream();
oleFormat.save(tmpStream);
Document doc2 = new Document(new ByteArrayInputStream(tmpStream.toByteArray()));

Thank you!
you helped a lot

1 Like