How to check whether document is empty or not

vke3 · February 27, 2019, 12:27pm

Hello,

I have one input docx file which has empty header and footer but it has no text.
I want to check whether the document is empty or not ( i.e. no text, no image, no table in it).
I checked on ASPOSE forum and tried those solution however though there is no text in document still it returning node cell is present in it. Tried below ( similar) approach

if (doc.toString(SaveFormat.TEXT).trim() == “”)
{
if (doc.getChildNodes(NodeType.SHAPE, true).getCount() == 0 && doc.getChildNodes(NodeType.TABLE, true).getCount() == 0
&& doc.getChildNodes(NodeType.PARAGRAPH, true).getCount() == 0 && doc.getChildNodes(NodeType.CELL, true).getCount() == 0){
System.out.println(“Document is empty”);
}else{
System.out.println(“Document is NOT empty”);
}
}
sharing my input docx file could you please check and help me to know why it’s not saying docx file is empty?
And what will be preferred way to check document is empty or not?
Below is the input file.
File.zip (19.5 KB)

awais.hafeez · February 27, 2019, 12:47pm

@vke3,

In this case, please try using the following code:

Document doc = new Document("E:\\file\\Sample Document.docx");

if (doc.toString(SaveFormat.TEXT).trim().equals(""))
{
    if (doc.getChildNodes(NodeType.SHAPE, true).getCount() == 0 && doc.getChildNodes(NodeType.TABLE, true).getCount() == 0
            && doc.getChildNodes(NodeType.CELL, true).getCount() == 0){
        System.out.println("Document is empty");
    }else{
        System.out.println("Document is NOT empty");
    }
}

Hope, this helps.

vke3 · February 27, 2019, 1:07pm

@awais.hafeez,

yes it woks
seems I did small mistake
Thanks a lot.
I have one more question, I am calling doc.getRange().replace method which implements IReplacingCallback. At the time of process of first replace method I want to call another replace method ( i.e. replace method inside another replace method). Is it feasible in ASPOSE?
Currently I am trying the same thing, however my first ( i.e. outer replace method )is working fine but not the another which I am calling inside ( replace method of interface itself won’t get call).

awais.hafeez · February 27, 2019, 1:32pm

@vke3,

Please ZIP and upload your 1) simplified input Word document, 2) Aspose.Words generated output document showing the undesired behavior, 3) expected document (you can create it by using MS Word) and 4) source code to reproduce the same problem on our end here for testing. We will then investigate the issue on our end and provide you more information.

vke3 · February 28, 2019, 1:03pm

@awais.hafeez,

Thank you for support. I solved that problem.
There is one more thing I am trying. In ASPOSE there is Find and Replace methods are available. However I just want to Find the some text not the replace. i.e. I have some regex expression and I want to get the data from document which matches my regex expression.
I couldn’t find on ASPOSE forum where reading the document line by line and get the required data from doc.

awais.hafeez · February 28, 2019, 1:26pm

@vke3,

I believe, you can meet this requirement by using the following code:

Document doc = new Document("E:\\temp\\input.docx");

FindReplaceOptions opts = new FindReplaceOptions();
opts.setDirection(FindReplaceDirection.BACKWARD);
opts.setReplacingCallback(new ReplacingCallback(Color.RED));

doc.getRange().replace(Pattern.compile("Replace1"), "" , opts);

doc.save("E:\\temp\\awjava-19.2.docx");

static class ReplacingCallback implements IReplacingCallback {
    public Color color;

    public ReplacingCallback(Color col){
        color = col;
    }

    public int replacing(ReplacingArgs e) throws Exception {
        // This is a Run node that contains either the beginning or the complete match.
        Node currentNode = e.getMatchNode();

        // The first (and may be the only) run can contain text before the match,
        // in this case it is necessary to split the run.
        if (e.getMatchOffset() > 0)
            currentNode = splitRun((Run) currentNode, e.getMatchOffset());

        ArrayList runs = new ArrayList();

        // Find all runs that contain parts of the match string.
        int remainingLength = e.getMatch().group().length();
        while ((remainingLength > 0) && (currentNode != null) && (currentNode.getText().length() <= remainingLength)) {
            runs.add(currentNode);
            remainingLength = remainingLength - currentNode.getText().length();

            // Select the next Run node.
            // Have to loop because there could be other nodes such as BookmarkStart etc.
            do {
                currentNode = currentNode.getNextSibling();
            } while ((currentNode != null) && (currentNode.getNodeType() != NodeType.RUN));
        }

        // Split the last run that contains the match if there is any text left.
        if ((currentNode != null) && (remainingLength > 0)) {
            splitRun((Run) currentNode, remainingLength);
            runs.add(currentNode);
        }

        // HERE YOU CAN WRITE YOUR OWN LOGIC
        // For example, we will change font color of all runs in the sequence.
        for (Run run : (Iterable<Run>) runs)
        {
            run.getFont().setName("Arial");
            run.getFont().setColor(color);
        }

        // Signal to the replace engine to do nothing because we have already done all what we wanted.
        return ReplaceAction.SKIP;
    }

    /**
     * Splits text of the specified run into two runs. Inserts the new run just
     * after the specified run.
     */
    private Run splitRun(Run run, int position) throws Exception {
        Run afterRun = (Run) run.deepClone(true);
        afterRun.setText(run.getText().substring(position));
        run.setText(run.getText().substring((0), (0) + (position)));
        run.getParentNode().insertAfter(afterRun, run);
        return afterRun;
    }
}

vke3 · February 28, 2019, 1:42pm

@awais.hafeez,

yes this code will help me to find and replace the text from document. But I don’t want to replace the text, I just want to get the matching text ( find text which matches my regex). I have some another operations to performed on that matched text. So how should I only find the required text?
Thanks.

awais.hafeez · February 28, 2019, 4:58pm

@vke3,

Please see near the end of the replacing event/method where you can write your own logic. The code just finds the text and changes the font name and color of the search string.

vke3 · March 1, 2019, 10:17am

@awais.hafeez,

Actually I found that the issue is with my pattern which has new line. I have below content which needs to be find using pattern,
<RTE_Update_Summary_ContentData> Summary

${file:RTE_Update_Summary_ContentData.docx}

</RTE_Update_Summary_ContentData>

I searched with below pattern
\<.\>.\n\$\{file:.\}\n\</.\>
This is working in java code ( pattern marcher ) however it’s not working in ASPOSE code ( in replace method)
tried many combination,
(\<RTE_._ContentData\>.(.|\n)\$\{file:.\}.(.|\n)\</.\>).
When I use pattern \<.\>.* then it works for first line of content ( <RTE_Update_Summary_ContentData> Summary) which seems that new line problem is there.

sharing you the docx file which has content. Please suggest pattern to search that content.
File.zip (54.8 KB)

awais.hafeez · March 2, 2019, 12:47am

@vke3,

Please ZIP and attach a simplified Java application (source code without compilation errors) that helps us to reproduce this problem on our end. Thanks for your cooperation.

vke3 · March 4, 2019, 8:00am

@awais.hafeez,

Sharing you one java file which has code, Input file and expected output.
File.zip (109.2 KB)

RTE_BP2S_PDFTemplate_ContentData.docx file I have already shared you in zip folder.The data which needs to be match with the pattern is below,
<RTE_Update_Summary_ContentData> Summary

${file:RTE_Update_Summary_ContentData.docx}

</RTE_Update_Summary_ContentData>

and same pattern which are repeated in the docx file needs to be replace.

I tried to execute it with the full Pattern ("\<.\>.\n\$\{file:.\}\n\<.\>") and nothing got replaced without any error and exception.
Then I tried with splitting pattern and first executed using pattern ("\<.\>"), the matched string got replaced properly. Then I again add next pattern till ("\<.\>.*") I got the exception,
java.lang.UnsupportedOperationException: The match includes one or more special or break characters and cannot be replaced.
In Replace functions it has mentioned that, An exception is thrown if a captured or replacement string contain one or more special characters: paragraph break, cell break, section break, field start, field separator, field end, inline picture, drawing object, footnote.
If the replace won’t handle the new line then what should be the another way to replace it?

awais.hafeez · March 4, 2019, 10:10am

@vke3,

I am afraid, we will not be able to write regex/pattern expressions. You have to write and experiment with different expressions on your end.

However, one simple way to get the desired output is as follows:

Document doc = new Document("E:\\file\\RTE_BP2S_PDFTemplate_ContentData.docx");
DocumentBuilder builder = new DocumentBuilder(doc);

BookmarkEnd end = null;
int i = 0;
for (Paragraph para : (Iterable<Paragraph>) doc.getChildNodes(NodeType.PARAGRAPH, true)) {

    if (para.toString(SaveFormat.TEXT).startsWith("<RTE_Update_")) {
        builder.moveTo(para);
        BookmarkStart start = builder.startBookmark("bm_" + i);
        end = builder.endBookmark("bm_" + i);
        para.insertBefore(start, para.getFirstChild());

        i++;
    }

    if (para.toString(SaveFormat.TEXT).startsWith("</RTE_Update_")) {
        if (end != null) {
            para.insertAfter(end, para.getLastChild());
        }
    }
}

for (Bookmark bm : doc.getRange().getBookmarks()) {
    if (bm.getName().startsWith("bm_")) {
        bm.setText("Found --> " + bm.getName());
        bm.remove();
    }
}

doc.save("E:\\file\\awjava-19.2.docx");

Hope, this helps.

vke3 · March 5, 2019, 4:53am

@awais.hafeez,

Yes, this helped me to achieve my requirement. Thanks a lot for your support.

awais.hafeez · March 5, 2019, 6:49am

@vke3,

It is great that you were able to achieve what you were looking for. Please let us know any time you have any further queries.

vke3 · March 11, 2019, 9:50am

@awais.hafeez,

I am looking to extract the content from bookmark ( bookmark may content text, shape, table and comments).
I didn’t found a way on aspose site which extract any kind of data from bookmark. Could you please help in that.? Currently I am able to extract text only but not the table or image (seems only paragraph node is extracted).
I have shared one input file contents one starting and ending token ( start token is the start of bookmark and end token as end of bookmark) and bookmark content ( text, table, image and comments) which I need to extract ( extracted content should keep the all of input doc style).
4 output files in which the extracted data should be stored ( Number of output files will be equal to number of bookmark in input file)
Code snippet ( not compiled as it has some different api) which has a business logic.
File.zip (335.8 KB)

Thanks

awais.hafeez · March 11, 2019, 1:31pm

@vke3,

I believe, you can meet this requirement after reading the following article:
Extract Selected Content Between Nodes

vke3 · March 13, 2019, 6:14am

@awais.hafeez,

Yes I refer the same link to extract bookmark, I found the way to extract those.
Thanks.

awais.hafeez · March 13, 2019, 6:36am

@vke3,

It is great that you were able to resolve this issue on your end. Please let us know any time you have any further queries.

vke3 · March 18, 2019, 10:02am

@awais.hafeez,

I have one query, In one of the Above Comment you have given a way to set the bookmark and remove it. You have used bm.setText(). Set text methods sets value to bookmark. now I want to remove that bookmark line itself from docx file. It means when I delete any full line in docx file then the next line goes on that place cursor position gets changed. Similarly I want after deleting bookmark the next line should come on the line of bookmark place. This will solve my blank line problem.

Thanks.

awais.hafeez · March 18, 2019, 3:01pm

@vke3,

Please try the following code:

Bookmark bm = doc.getRange().getBookmarks().get("SomeBookmark");
Paragraph para = (Paragraph)bm.getBookmarkStart().getAncestor(NodeType.PARAGRAPH);
if (para != null){
    // check if it is empty, then remove it
}

Hope, this helps.