Get document details

Dear Team,

We need to get equations count and image count in particular word document file. How to get count from document.

For Example:
Equations in document : 20
Omath types in document : 24
Numbered Images in document : 5
Un numbered images in document : 2

Please given solution for this using aspose.words in java.
Input Document : inpt.zip (315.2 KB)

Thank you.

@ssvel

Thanks for your inquiry. The Shape class represents an object in the drawing layer, such as an AutoShape, textbox, freeform, OLE object, ActiveX control, or picture.

To get the equation, you can use Shape.OleFormat.ProgId property.

Please use Document.getChildNodes() method as shown below to get the count of images.

Document doc = new Document(MyDir + "input.doc");
System.out.print(doc.getChildNodes(NodeType.SHAPE, true).getCount());

Could you please share what are “Omath”, “Numbered Images” and “Un numbered” images?

@tahir.manzoor

Thanks for your comments. Figure caption presented in the image mean that’s numbered image or without figure caption present mean that’s un numbered image.

Get all equations(Omath) count in a document.

FYR : For your reference.zip (153.6 KB)

Please find the input document for your reference inpt.zip (315.2 KB)

Please give solution for get count separately.

Thank you

@ssvel

Thanks for sharing the detail.

To get the OLE equation, you can use Shape.OleFormat.ProgId property as shown below.

Document doc = new Document(MyDir + "in.docx");
int count = 0;
NodeCollection shapes = doc.getChildNodes(NodeType.SHAPE, true);
for (Shape shape : (Iterable<Shape>) shapes) {
    if (shape.getOleFormat() != null && shape.getOleFormat().getProgId().contains("Equation")) {
        count++;
    }
}

System.out.println(count);

In this case, we suggest you please iterate over the paragraphs and get the their text using Node.toString method to check if it is started with “Fig”.

Document doc = new Document(MyDir + "in.docx");
int count = 0;
NodeCollection paragraphs = doc.getChildNodes(NodeType.PARAGRAPH, true);
for (Paragraph  paragraph : (Iterable<Paragraph>) paragraphs) {
    if (paragraph.toString(SaveFormat.TEXT).trim().startsWith("Fig")) {
        count++;
    }
}

System.out.println(count);

You can use the following code snippet to get the count of images whose next sibling’s text is not started with “Fig”.

Document doc = new Document(MyDir + "in.docx");
int count = 0;
NodeCollection shapes = doc.getChildNodes(NodeType.SHAPE, true);
for (Shape shape : (Iterable<Shape>) shapes) {
    Node node = shape.getParentParagraph().getNextSibling();
    
    if (!node.toString(SaveFormat.TEXT).trim().startsWith("Fig")) {
        count++;
    }
}

System.out.println(count);

@tahir.manzoor

Thanks for your valuable comments. Its working fine for numbered and un numbered images. But equations are not getting properly in some documents.

Please find the below input document for your reference. input.zip (3.0 MB)

Current Output :
Number of equations : 0
Numbered Image : 6

Thank you.

@ssvel

Thanks for your inquiry. The OfficeMath class represents an Office Math object such as function, equation, matrix or alike. You can use following code example to get equation count.

Document doc = new Document(MyDir + "22_ICMEE18_P Sanjay .docm");
System.out.println(doc.getChildNodes(NodeType.OFFICE_MATH, true).getCount());

In your previous post, you share the document that has OLE equations. You can use both code examples to get the count of equations.

@tahir.manzoor

Thanks for your replying. I just need a clarification that the below mentioned equation is an single equation. But the equation count I’m getting as 7.

α=(2.303 )/t log 1/T

Please find the sample document for your reference input.zip (2.7 MB). In this document only two OMath equations presented while I’m executing It’s providing output count as 11.

Please give the solution for this scenario.

Thank you.

@ssvel

Thanks for your inquiry. Please use the following code example to get the desired output.

Document doc = new Document(MyDir + "3_ICMEE18_P. S. Rohitha .docm");
NodeCollection equations = doc.getChildNodes(NodeType.OFFICE_MATH, true);
int count = 0;
for (OfficeMath  officeMath : (Iterable<OfficeMath>) equations)
{
    if(officeMath.getAncestor(NodeType.OFFICE_MATH) == null)
        count ++;
}
System.out.println("officeMath count : "+count);

@tahir.manzoor

Thank you for your support. It’s working fine now. For the un numbered figure caption I’m getting null pointer exception. Please find the code below.

Document doc = new Document(MyDir + “in.docx”);
int count = 0;
NodeCollection shapes = doc.getChildNodes(NodeType.SHAPE, true);
for (Shape shape : (Iterable) shapes) {
Node node = shape.getParentParagraph().getNextSibling();

if (node.toString(SaveFormat.TEXT).trim().startsWith("Fig")) {
    count++;
}

}

System.out.println(count);

The expected output is the figure doesn’t having figure caption need to be extracted.

The sample document for your reference. input.zip (773.4 KB)

@ssvel

Thanks for your inquiry. We have tested the scenario using the latest version of Aspose.Words for Java 18.11 and have not found the shared issue. Please upgrade to the latest version of Aspose.Words.

Moreover, you can modify the if condition as shown below to avoid the exception.

if (node!= null &&node.toString(SaveFormat.TEXT).trim().startsWith("Fig")) {
    count++;
}

@tahir.manzoor

Thanks for your valuable comments. I’ve updated Aspose.Words for Java 18.11 but we didn,t get proper output. We need solution for following kind of things.

  1. Get Total number of image count in word(input) document.
  2. Get Un_numbered(Without figure captions) image count in word(input) document.

Note : We already get proper result for numbered(Figure caption present) image count and equations count.

Sample input document input.zip (807.5 KB)

Expected Output :
Total number of images in document : 6
With out captions images : 4

Thank you.

@ssvel

Thanks for your inquiry. Shape Class represents an object in the drawing layer, such as an AutoShape, textbox, freeform, OLE object, ActiveX control, or picture.

Could you please share the conditions based on which you are identifying the number of images and Un_numbered images? Please also mark theses shapes in your input document and share it here for our reference. We will then write the code according to the shared conditions and share it with you.

@tahir.manzoor

We have inserted the comments for numbered and un_numbered images in the attached input document. As we need in the shapes of picture/Drawing tool shapes. As a result we need total count of images(picture tool image/ drawing tool image/ table tool images).

Note : Figure caption present in the image as numbered image or without figure caption present as un numbered image.

Please find the attached input document and please refer the inserted comments for your reference.

Input : input.zip (812.0 KB)

I’d grateful if you provide the solution ASAP.

Thank you.

@ssvel

Thanks for sharing the detail. You can achieve your requirement using following code example. However, the Shape.ParentParagraph property returns null value. This is an issue. For the sake of correction, we have logged this problem in our issue tracking system as WORDSNET-17768. You will be notified via this forum thread once this issue is resolved.

We apologize for your inconvenience.

Document doc = new Document(MyDir + "in.docx");
int count = 0;
NodeCollection shapes = doc.getChildNodes(NodeType.SHAPE, true);
for (Shape shape : (Iterable<Shape>) shapes) {
    Node node = shape.getParentParagraph().getNextSibling();
    
    if (!node.toString(SaveFormat.TEXT).trim().startsWith("Fig")) {
        count++;
    }
}

System.out.println(count);

@tahir.manzoor

Thanks for your support. We need one more clarification about related to this. We already get OMath/OLE equations count,Now we need to get picture tool type equations. I’ve attached sample document and inserted comment into this document for your reference.

Input Document : 34.zip (396.2 KB)

Expected Output :

Picture tool eqn count : 1

If any possible to get count of this scenarios mean please give solution.

Thank you

@ssvel

Thanks for your inquiry. Please use the LoadOptions.ConvertShapeToOfficeMath property as shown below to get the desired output. This property get or set whether to convert shapes with EquationXML to Office Math objects.

LoadOptions options = new LoadOptions();
options.setConvertShapeToOfficeMath(true);
Document doc = new Document(MyDir + "34.doc", options);

NodeCollection equations = doc.getChildNodes(NodeType.OFFICE_MATH, true);
int count = 0;
for (OfficeMath  officeMath : (Iterable<OfficeMath>) equations)
{
    if(officeMath.getAncestor(NodeType.OFFICE_MATH) == null)
        count ++;
}
System.out.println("officeMath count : "+count);