Extraction issue using Java

e503824 · April 4, 2022, 4:41am

Dear team,

we are facing image extraction issue for below input document

Input : Figures.docx (899.2 KB)

we are using below java conditions for image extraction

for (Paragraph paragraph : (Iterable<Paragraph>)paragraphs)
{
    System.out.println("paragraph  6:" + paragraph.getText().toString());
    try
    {
        if ((paragraph.toString(SaveFormat.TEXT).trim().startsWith("Fig")
                || paragraph.toString(SaveFormat.TEXT).startsWith("Scheme")
                || paragraph.toString(SaveFormat.TEXT).startsWith("Plate")
                || paragraph.toString(SaveFormat.TEXT).startsWith("Abb")
                || paragraph.toString(SaveFormat.TEXT).startsWith("Abbildung")
                    && paragraph.getNodeType() != NodeType.TABLE)
                //						//changes by pavi -starts check sample  D:\testing\AIE\Iteration 16_4 points\Document contains Duplicate figure captions\Revised-MANUSCRIPT
                && ((paragraph.getNextSibling() != null
                && paragraph.getNextSibling().getNodeType() != NodeType.TABLE)
                || paragraph.getParentSection().getBody().getFirstParagraph().getText().trim()
                        .matches(matches))

                //	&& paragraph.getNextSibling().getNodeType() != NodeType.TABLE
                //changes by pavi -end 
                && paragraph.getChildNodes(NodeType.SHAPE, true).getCount() == 0
                && !paragraph.toString(SaveFormat.TEXT).contains(AIE.docName)
                && !paragraph.getNextSibling().toString(SaveFormat.TEXT).trim().matches(matches)//duplicate caption by pavi
                && !(paragraph.toString(SaveFormat.TEXT).trim().startsWith("Figures") ||
                    !(paragraph.toString(SaveFormat.TEXT).trim().startsWith("Figure Captions"))))
        {
            System.out.println("para  :" + paragraph.getText().toString());
            // supplymentry check sample: JCIS_SRE_2020_1_2nd_revision.docx
            if (AIE.supplymentryCheck(paragraph.toString(SaveFormat.TEXT).trim()))
            {
                AIE.insertBookmark(interimdoc, paragraph, AIE.fileName);
                continue;
            }

please do need full

alexey.noskov · April 4, 2022, 5:26am

@e503824 You can use the same approach suggested in another your thread. The only thing you should change is the pattern:

Pattern pattern = Pattern.compile("(Fig\\.\\s*\\d+)");

e503824 · April 4, 2022, 6:27am

hi team,

please provide the condition like my source code to update

alexey.noskov · April 4, 2022, 7:16am

@e503824 The approach with 14 conditions in one if statement is likely to be error prone and quite difficult to handle. So I would ne suggest to use this approach and that is why I suggested you to use the approach with IReplacingCallback.
If the goal of you code is to extract figures with their caption, another possible approach would be to search for figures first and then for captions:

Document doc = new Document("C:\\Temp\\in.docx");

// Create a destination document.
Document dst = (Document)doc.deepClone(false);
dst.ensureMinimum();

// Get shapes in the document.
Iterable<Shape> shapes = doc.getChildNodes(NodeType.SHAPE, true);
for (Shape s :shapes)
{
    // Get shape parent paragraph.
    Paragraph shapePara = s.getParentParagraph();

    // Get next sibling and check whether it is a caption.
    Paragraph nextPara = (Paragraph)shapePara.getNextSibling();

    if (nextPara != null)
    {
        String nextParaText = nextPara.toString(SaveFormat.TEXT).trim();
        if (nextParaText.startsWith("Fig."))
        {
            dst.getFirstSection().getBody().appendChild(dst.importNode(shapePara, true, ImportFormatMode.USE_DESTINATION_STYLES));
            dst.getFirstSection().getBody().appendChild(dst.importNode(nextPara, true, ImportFormatMode.USE_DESTINATION_STYLES));
        }
    }
}

dst.save("C:\\Temp\\out.pdf");

e503824 · April 4, 2022, 12:12pm

Dear team,

I’m using multiple scenario in same java code for image extraction but this scenario only not working for me please help me solve this issue

alexey.noskov · April 4, 2022, 4:03pm

@e503824 Have you tried using the suggested approaches? Both of them works with your document and are much easier to debug and handle then if with 14 conditions.
Also, I cannot test your code because matches variable is missed in the provided snippet.

e503824 · April 5, 2022, 6:16am

Yes i have tried but other few scenarios not working that’s why, we are facing extraction issue for give file only any think possible to give single condition for this issue

e503824 · April 5, 2022, 6:18am

in this case we are facing duplicate figure caption names, while I’m removing duplicate figure captions it is extracting

alexey.noskov · April 5, 2022, 9:51am

@e503824 Could you please provide a compilable code example or simple application that will allow us to debug your code? We will check the issue and provide you more information.

e503824 · April 6, 2022, 4:21am

Dear team,

we are facing image extraction issue for below input document

input document : Figures.docx (901.5 KB)

we are using below conditions

if ((paragraph.toString(SaveFormat.TEXT).trim().startsWith("Fig")
    || paragraph.toString(SaveFormat.TEXT).startsWith("Scheme")
    || paragraph.toString(SaveFormat.TEXT).startsWith("Plate")
    || paragraph.toString(SaveFormat.TEXT).startsWith("Abb")
    || paragraph.toString(SaveFormat.TEXT).startsWith("Abbildung")
            && paragraph.getNodeType() != NodeType.TABLE)
    //						//changes by pavi -starts check sample  D:\testing\AIE\Iteration 16_4 points\Document contains Duplicate figure captions\Revised-MANUSCRIPT
    && ((paragraph.getNextSibling() != null
    && paragraph.getNextSibling().getNodeType() != NodeType.TABLE)
    || paragraph.getParentSection().getBody().getFirstParagraph().getText().trim()
            .matches(matches))
    //	&& paragraph.getNextSibling().getNodeType() != NodeType.TABLE
    //changes by pavi -end 
    && paragraph.getChildNodes(NodeType.SHAPE, true).getCount() == 0
    && !paragraph.toString(SaveFormat.TEXT).contains(AIE.docName)
    && !paragraph.getNextSibling().toString(SaveFormat.TEXT).trim().matches(matches)//duplicate caption by pavi
    && !(paragraph.toString(SaveFormat.TEXT).trim().startsWith("Figure Captions")) ||
        !(paragraph.toString(SaveFormat.TEXT).trim().startsWith("Figures")))
{

please do needful

alexey.noskov · April 6, 2022, 5:58am

@e503824 I have once again investigates your conditions and still think it is hard to handle. You can refactor it to make it more readable and easier to handle. The following condition looks suspicious:

paragraph.getParentSection().getBody().getFirstParagraph().getText().trim().matches(matches))

It always check the first paragraph of the section. Is it intended?

String paraText = paragraph.toString(SaveFormat.TEXT).trim();

Node nextSibling = paragraph.getNextSibling();
Boolean nextSiblingIsNotTable = (nextSibling != null && nextSibling.getNodeType() != NodeType.TABLE);

Boolean paraHasShape = paragraph.getChildNodes(NodeType.SHAPE, true).getCount() != 0;

Node prevSibling = paragraph.getPreviousSibling();
Boolean previousSiblingHasShape = (prevSibling != null)
        && (prevSibling.getNodeType() == NodeType.PARAGRAPH)
        && ((Paragraph)prevSibling).getChildNodes(NodeType.SHAPE, true).getCount() != 0;

if (likelyCaption(paragraph)
        && !likelyCaption(paragraph.getNextSibling())
        && nextSiblingIsNotTable
        && !paraText.contains("AIE.docName")
        && !paraText.startsWith("Figure Captions")
        && !paraText.startsWith("Figures")
        && (previousSiblingHasShape || paraHasShape))
{
    System.out.println(paraText);
}

private static Boolean likelyCaption(Node node) throws Exception
{
    if(node == null)
        return false;

    String paraText = node.toString(SaveFormat.TEXT).trim();
    Boolean startsWithCaption = paraText.startsWith("Fig")
            || paraText.startsWith("Scheme")
            || paraText.startsWith("Plate")
            || paraText.startsWith("Abb")
            || paraText.startsWith("Abbildung");

    return startsWithCaption;
}