Extract part figures

ssvel · October 9, 2018, 12:13pm

Dear Team,

We need to extract part figures from word document. Please find the attached input and expected output document.

Input : input.zip (28.5 KB)

Current Output : Current OP.zip (28.4 KB)

Expected Output : Expected OP.zip (34.8 KB)

Please provide the solution for above scenarios.

Thanking You.

tahir.manzoor · October 9, 2018, 6:08pm

@ssvel

Thanks for your inquiry. Please use the following code example to get the desired output. Hope this helps you.

Document doc = new Document(MyDir + "PartfigProblem.docx");
doc.updateListLabels();
int i = 1;
ArrayList nodes = new ArrayList();
 
for (Paragraph  paragraph : (Iterable<Paragraph>) doc.getChildNodes(NodeType.PARAGRAPH, true))
{
    if(paragraph.toString(SaveFormat.TEXT).trim().startsWith("Figure"))
    {
        Node previousPara = paragraph.getPreviousSibling();
        while (previousPara != null
                && previousPara.getNodeType() == NodeType.PARAGRAPH
                && previousPara.toString(SaveFormat.TEXT).trim().length() == 0)
        {
            if(previousPara != null)
                nodes.add(previousPara);
            previousPara = previousPara.getPreviousSibling();
        }

        if(nodes.size() > 0)
        {
            //Reverse the node collection.
            Collections.reverse(nodes);

            //Extract the consecutive shapes and export them into new document
            Document dstDoc = new Document();
            for (Paragraph para : (Iterable<Paragraph>)nodes)
            {
                NodeImporter importer = new NodeImporter(doc, dstDoc, ImportFormatMode.KEEP_SOURCE_FORMATTING);
                Node newNode = importer.importNode(para, true);
                dstDoc.getFirstSection().getBody().appendChild(newNode);
            }
            //Remove the first empty paragraph
            if(dstDoc.getFirstSection().getBody().getFirstParagraph().toString(SaveFormat.TEXT).trim().length() == 0)
                dstDoc.getFirstSection().getBody().getFirstParagraph().remove();

            dstDoc.save(MyDir + "output"+i+".docx");
            i++;
            nodes.clear();
        }
    }
}

ssvel · October 11, 2018, 6:19am

@tahir.manzoor

Thanks for your quick update. Its working fine for some documents. But some of the document part figures are not extracted. I have attached the document for your reference. Please find and give solution for this scenario.

Input : Revised_Manuscript.zip (4.0 MB)

Expected OP : Expected OP.zip (2.0 MB)

Thank you.

tahir.manzoor · October 11, 2018, 1:49pm

@ssvel

Thanks for your inquiry. We suggest you please read following articles.
Aspose.Words Document Object Model
Extract Selected Content Between Nodes

You can use the same approach shared in my previous post to get the desired output. In your case, you need to iterate over paragraphs and get the previous sibling nodes of paragraph whom text starts with “Figure”. Here is the description of API used in the code example.

The Node.toString(SaveFormat.TEXT) method returns the text of a node.
The Node.PreviousSibling property returns the node immediately preceding this node.
The NodeImporter class allows to efficiently perform repeated import of nodes from one document to another. The NodeImporter.ImportNode method imports a node from one document into another.

Please check Aspose.Words for Java - API Reference.

ssvel · October 12, 2018, 12:27pm

@tahir.manzoor

Thanks for your support. Please give some sample source for this scenario.

tahir.manzoor · October 12, 2018, 6:45pm

@ssvel

Thanks for your inquiry. Please give us some time. We will write the code example for your scenario and share it here soon.

tahir.manzoor · October 15, 2018, 8:19am

@ssvel

Please use the following code example to get the desired output. Hope this helps you.

Document doc = new Document(MyDir + "Revised_Manuscript.docx");
DocumentBuilder builder = new DocumentBuilder(doc);

UseCase3(doc, builder);
ExtractImages(doc, "uc1");

UseCase4(doc, builder);
ExtractImages(doc, "uc2");

public static void UseCase3(Document doc, DocumentBuilder builder) throws Exception
{
    int bookmark = 1;
    int i = 1;
    NodeCollection paragraphs = doc.getChildNodes(NodeType.PARAGRAPH, true);
    for (Paragraph  paragraph : (Iterable<Paragraph>) paragraphs)
    {
        if(paragraph.toString(SaveFormat.TEXT).trim().startsWith("Fig"))
        {
            Boolean bln = false;
            Node PreviousPara = paragraph.getPreviousSibling();
            while (PreviousPara != null && PreviousPara.getNodeType() == NodeType.PARAGRAPH
                    && PreviousPara.toString(SaveFormat.TEXT).trim().length() == 0
                    && ((Paragraph)PreviousPara).getChildNodes(NodeType.SHAPE, true).getCount() > 0)
            {
                PreviousPara = PreviousPara.getPreviousSibling();
                bln = true;
            }

            if(!bln)
                continue;

            if(PreviousPara == null)
            {
                builder.moveToDocumentStart();
                builder.insertParagraph();
                builder.startBookmark("Bookmark" + bookmark);
                builder.moveToParagraph(paragraphs.indexOf(paragraph), 0);
                builder.endBookmark("Bookmark" + bookmark);
                bookmark++;
            }
            else if(PreviousPara.getNodeType() == NodeType.PARAGRAPH)
            {
                Node node = ((Paragraph)PreviousPara).getParentNode().insertBefore(new Paragraph(doc), PreviousPara);
                builder.moveTo(node);
                builder.startBookmark("BookmarkUC1" + bookmark);
                builder.moveTo(paragraph);
                builder.endBookmark("BookmarkUC1" + bookmark);
                bookmark++;
            }
        }
    }
}

public static void UseCase4(Document doc, DocumentBuilder builder) throws Exception
{
    int bookmark = 1;
    int i = 1;
    NodeCollection paragraphs = doc.getChildNodes(NodeType.PARAGRAPH, true);
    for (Paragraph  paragraph : (Iterable<Paragraph>) paragraphs)
    {
        if(paragraph.toString(SaveFormat.TEXT).trim().startsWith("Fig"))
        {
            Boolean bln = false;
            Node PreviousPara = paragraph.getPreviousSibling();
            while (PreviousPara != null && PreviousPara.getNodeType() == NodeType.PARAGRAPH  &&
                    (PreviousPara.toString(SaveFormat.TEXT).trim().length() == 0 ||
                            (
                                    PreviousPara.toString(SaveFormat.TEXT).trim().contains("(a)") ||
                                            PreviousPara.toString(SaveFormat.TEXT).trim().contains("(b)") ||
                                            PreviousPara.toString(SaveFormat.TEXT).trim().contains("(b)") ||
                                            PreviousPara.toString(SaveFormat.TEXT).trim().contains("(d)"))
                      )
                    )
            {
                if(PreviousPara.toString(SaveFormat.TEXT).trim().contains("Fig") == true)
                    break;
                System.out.println(PreviousPara.getText());
                PreviousPara = PreviousPara.getPreviousSibling();
                bln = true;
            }

            if(!bln)
                continue;

            if(PreviousPara == null)
            {
                builder.moveToDocumentStart();
                builder.insertParagraph();
                builder.startBookmark("Bookmark" + bookmark);
                //builder.moveToParagraph(paragraphs.indexOf(paragraph), 0);
                System.out.println(paragraph.getText());
                builder.moveTo(paragraph);
                builder.writeln(">>>");
                builder.endBookmark("Bookmark" + bookmark);
                bookmark++;
            }
            else
            if(PreviousPara.getNodeType() == NodeType.PARAGRAPH)
            {
                Node node = ((Paragraph)PreviousPara).getParentNode().insertBefore(new Paragraph(doc), PreviousPara);
                builder.moveTo(node);
                builder.startBookmark("BookmarkUC1" + bookmark);
                builder.moveTo(paragraph);
                builder.endBookmark("BookmarkUC1" + bookmark);
                bookmark++;
            }
        }
    }
}

public  static void ExtractImages(Document doc, String uc) throws Exception
{
    int i = 1;
    for (Bookmark bm : doc.getRange().getBookmarks())
    {
        if(bm.getName().startsWith("Bookmark"))
        {
            ArrayList nodes =  ExtractContents.extractContent(bm.getBookmarkStart(), bm.getBookmarkEnd(), true);
            Document dstDoc = ExtractContents.generateDocument(doc, nodes);

            PageSetup sourcePageSetup = ((Paragraph)bm.getBookmarkStart().getParentNode()).getParentSection().getPageSetup();
            dstDoc.getFirstSection().getPageSetup().setPaperSize(sourcePageSetup.getPaperSize());
            dstDoc.getFirstSection().getPageSetup().setLeftMargin(sourcePageSetup.getLeftMargin());
            dstDoc.getFirstSection().getPageSetup().setRightMargin(sourcePageSetup.getRightMargin());

            dstDoc.updatePageLayout();
            if(dstDoc.getLastSection().getBody().getLastParagraph().toString(SaveFormat.TEXT).trim().startsWith("Fig"))
                dstDoc.getLastSection().getBody().getLastParagraph().remove();

            dstDoc.updatePageLayout();
            while(dstDoc.getFirstSection().getBody().getFirstParagraph()!= null && dstDoc.getFirstSection().getBody().getFirstParagraph().getChildNodes(NodeType.SHAPE, true).getCount() == 0)
                dstDoc.getFirstSection().getBody().getFirstParagraph().remove();

            dstDoc.updatePageLayout();
            if(dstDoc.getFirstSection().getBody().getFirstParagraph().getChildNodes(NodeType.SHAPE, true).getCount() > 0)
            {
                String filename = bm.getBookmarkEnd().getParentNode().toString(SaveFormat.TEXT);
                if(filename.trim().length() > 0)
                    dstDoc.save(MyDir + filename.substring(0, 7) + "_out.docx");
                i++;
            }

        }
    }

    for (Bookmark bm : doc.getRange().getBookmarks()) {
        if (bm.getName().startsWith("Bookmark")) {
            bm.getBookmarkEnd().getParentNode().insertBefore(new BookmarkEnd(doc, bm.getName()), bm.getBookmarkEnd().getParentNode().getFirstChild());
        }
    }
    doc.updatePageLayout();
    for (Bookmark bm : doc.getRange().getBookmarks()) {

        if (bm.getName().startsWith("Bookmark")) {
            bm.getBookmarkEnd().getParentNode().insertBefore(new BookmarkEnd(doc, bm.getName()), bm.getBookmarkEnd().getParentNode().getFirstChild());
            String figText = bm.getBookmarkEnd().getParentNode().toString(SaveFormat.TEXT);
            if(figText.trim().length() > 0)
                bm.setText("<Fig>"+figText.trim().substring(0, 7)+"</Fig>" + ControlChar.PARAGRAPH_BREAK);
        }
    }
}

ssvel · October 23, 2018, 6:18am

@tahir.manzoor

Thanks for your support. I will check and update. Before that i have used previous extraction code but they have extract some empty pages and paras. please give solution for this scenario.

Code : Part.zip (845 Bytes)

Input : input.zip (364.0 KB)

Current OP : Current_Output.zip (274.3 KB)

We need to remove empty pages and empty paragraphs in extracted output.

Thank you.

tahir.manzoor · October 23, 2018, 3:39pm

@ssvel

Thanks for your inquiry. Please check the following IF condition in the code example shared in my previous post. Please use it in your code to avoid empty documents.

if(dstDoc.getFirstSection().getBody().getFirstParagraph().getChildNodes(NodeType.SHAPE, true).getCount() > 0)
{
}