Remove all blank lines from word document

priyadharshini · June 21, 2017, 11:43am

Problem.zip (462.4 KB)
Hi team,

Requiring a work around solution to remove the blank lines between images and image caption in the word document.
To remove blank lines from entire document. I have tried using empty paragraph remover but it didnt clear lines. Due to time consistency requiring solution as soon as possible.

Regards
Priya Dharshini J P

tahir.manzoor · June 21, 2017, 1:26pm

Hi Priya,

Thanks for your inquiry. In your expected output document, you are removing empty paragraphs and joining two paragraphs. The first paragraph contains the Shape (images) nodes and second contains the text starts with “Figure”. Please use following code example to get the desired output. Hope this heps you.

Document doc = new Document(MyDir + "Problem.docx");

ArrayList nodes = new ArrayList();
for (Paragraph  paragraph : (Iterable<Paragraph>) doc.getChildNodes(NodeType.PARAGRAPH, true))
{
    if(paragraph.toString(SaveFormat.TEXT).trim().length() == 0 && paragraph.getChildNodes(NodeType.SHAPE, true).getCount() == 0)
    {
        paragraph.remove();
    }
    else if(paragraph.toString(SaveFormat.TEXT).trim().length() == 0 && paragraph.getChildNodes(NodeType.SHAPE, true).getCount() > 1 )
    {
        nodes.add((paragraph));
    }
}

for (Paragraph  paragraph : (Iterable<Paragraph>) nodes)
{
    Paragraph nextPara = (Paragraph)paragraph.getNextSibling();
    if(nextPara.toString(SaveFormat.TEXT).trim().startsWith("Figure"))
    {
        // Move all content from the nextPara paragraph into the first.
        while (nextPara.hasChildNodes())
            paragraph.appendChild(nextPara.getFirstChild());

        nextPara.remove();
    }
}
doc.save(MyDir + "output.docx");

priyadharshini · June 22, 2017, 12:34am

But in case of consecutive images, the space between them is not removed. can you help out to remove space in case of group images in that document…

priyadharshini · June 22, 2017, 5:12am

we need an additional requirement to removing the blank lines space in between the images.Due to time consistency requiring reply as soon as possible.

tahir.manzoor · June 22, 2017, 12:25pm

Hi Priya,

Thanks for your inquiry.

Could you please share the screenshots of problematic sections of output document? We will investigate this issue and provide you more information on this [quote=“priyadharshini, post:4, topic:414, full:true”]
we need an additional requirement to removing the blank lines space in between the images.Due to time consistency requiring reply as soon as possible.
[/quote]
Please share the screenshots of this requirements along with expected output document. We will then provide you more information on this along with code.

Best Regards,
Tahir Manzoor

priyadharshini · June 22, 2017, 2:58pm

problem.zip (50.0 KB)
Attached an example of consecutive images, pls help out to extract all images till image caption text starting with “Fig” occurs.

tahir.manzoor · June 22, 2017, 6:43pm

@priyadharshini

Thanks for sharing your requirement in detail. Please spare us some time for the analysis of your desired output. We will get back to you soon with code example according to your requirement.

Best Regards,
Tahir Manzoor

tahir.manzoor · June 23, 2017, 4:34pm

@priyadharshini

Thanks for your patience. Please use following code example to achieve your requirement. Hope this helps you.

Document doc = new Document(MyDir + "Problem.docx");

ArrayList nodes = new ArrayList();
for (Paragraph  paragraph : (Iterable<Paragraph>) doc.getChildNodes(NodeType.PARAGRAPH, true))
{
    if(paragraph.toString(SaveFormat.TEXT).trim().length() == 0
            && paragraph.getChildNodes(NodeType.SHAPE, true).getCount() == 0
            && paragraph.getText().contains(ControlChar.PAGE_BREAK) == false)
    {
        paragraph.remove();
    }
    else if(paragraph.toString(SaveFormat.TEXT).trim().length() == 0 && paragraph.getChildNodes(NodeType.SHAPE, true).getCount() > 1 )
    {
        nodes.add((paragraph));
    }
}

for (Paragraph  paragraph : (Iterable<Paragraph>) nodes)
{
    Paragraph nextPara = (Paragraph)paragraph.getNextSibling();
    if(nextPara.toString(SaveFormat.TEXT).trim().startsWith("Figure"))
    {
        // Move all content from the nextPara paragraph into the first.
        while (nextPara.hasChildNodes())
            paragraph.appendChild(nextPara.getFirstChild());

        nextPara.remove();

        Paragraph previousPara = (Paragraph)paragraph.getPreviousSibling();
        while (previousPara != null
                && previousPara.toString(SaveFormat.TEXT).trim().length() == 0 && previousPara.getChildNodes(NodeType.SHAPE, true).getCount() > 0)
        {
            if(previousPara != null)
                previousPara.getParagraphBreakFont().setSize(.5);
            previousPara = (Paragraph)previousPara.getPreviousSibling();
        }
    }
}
doc.save(MyDir + "output.docx");

priyanga · June 24, 2017, 5:13am

InputDocument.zip (1.2 MB)
ExpectedOutput.zip (1014.1 KB)
Hi team,
Thanks for your reply.I am using the above mentioned code .Again Iam not able to get the expected output.Here i will attach the input document and excepted output document with it .I will waitng for your reply…Please kindly consider the ExpectedOutput.zip (1.3 MB)
latest expectedoutput document for reference.

tahir.manzoor · June 24, 2017, 8:13pm

@priyanga

Thanks for your inquiry. In this case we suggest you following solution. Hope this helps you.

Document doc = new Document(MyDir + "test (4).DOC");
RemoveSectionBreaks(doc);

int i = 1;
DocumentBuilder builder = new DocumentBuilder(doc);

//Remove empty paragraphs
for (Paragraph  paragraph : (Iterable<Paragraph>) doc.getChildNodes(NodeType.PARAGRAPH, true)) {
    if (paragraph.toString(SaveFormat.TEXT).trim().length() == 0
            && paragraph.getChildNodes(NodeType.SHAPE, true).getCount() == 0) {
        paragraph.remove();
    }
}
doc.updatePageLayout();

Boolean hasImage = false;
//Get the paragraphs that start with "Fig".
for (Paragraph  paragraph : (Iterable<Paragraph>)doc.getChildNodes(NodeType.PARAGRAPH, true))
{
    if(paragraph.toString(SaveFormat.TEXT).trim().contains("Fig"))
    {
        Node previousPara = paragraph.getPreviousSibling();
        while (previousPara != null
                && previousPara.getNodeType() == NodeType.PARAGRAPH
                && previousPara.toString(SaveFormat.TEXT).trim().length() == 0
                && ((Paragraph)previousPara).getChildNodes(NodeType.SHAPE, true).getCount() > 0)
        {
            previousPara = previousPara.getPreviousSibling();
            hasImage = true;
        }

        if(hasImage && previousPara != null)
        {
            builder.moveTo(((CompositeNode)previousPara).getFirstChild());
            builder.startBookmark("Bookmark"+i);
            builder.endBookmark("Bookmark"+i);

            builder.moveTo(paragraph.getRuns().get(0));
            builder.startBookmark("FigBookmark"+i);
            builder.endBookmark("FigBookmark"+i);
            i++;
        }
        hasImage = false;
    }
}
for(int b = 1 ; b < i ; b++)
{
    Node start = doc.getRange().getBookmarks().get("Bookmark" + b).getBookmarkStart();
    Node end = doc.getRange().getBookmarks().get("FigBookmark" + b).getBookmarkEnd();
    ArrayList images =  ExtractContents.extractContent(start, end, false);
    Document dstDoc = ExtractContents.generateDocument(doc, images);


    if(dstDoc.getFirstSection().getBody().getFirstParagraph().toString(SaveFormat.TEXT).trim().length() > 0)
        for (Run  run : (Iterable<Run>)dstDoc.getFirstSection().getBody().getFirstParagraph().getChildNodes(NodeType.RUN, true))
        {
            run.setText("");
        }

     dstDoc.getRange().replace(ControlChar.PAGE_BREAK, "", new FindReplaceOptions());

    dstDoc.save(MyDir + "Fig_"+ b + ".docx");
} 

private static void RemoveSectionBreaks(Document doc)
{
    // Loop through all sections starting from the section that precedes the last one
    // and moving to the first section.
    for (int i = doc.getSections().getCount() - 2; i >= 0; i--)
    {
        // Copy the content of the current section to the beginning of the last section.
        doc.getLastSection().prependContent(doc.getSections().get((i)));
        // Remove the copied section.
        doc.getSections().get(i).remove();
    }
}

priyadharshini · June 29, 2017, 8:15am

Thanking You @tahir.manzoor .

Regards
Priya Dharshini J P

tahir.manzoor · June 29, 2017, 11:17am