How to check whether document is empty or not

vke3 · February 28, 2019, 1:42pm

yes this code will help me to find and replace the text from document. But I don’t want to replace the text, I just want to get the matching text ( find text which matches my regex). I have some another operations to performed on that matched text. So how should I only find the required text?
Thanks.

awais.hafeez · February 28, 2019, 4:58pm

@vke3,

Please see near the end of the replacing event/method where you can write your own logic. The code just finds the text and changes the font name and color of the search string.

vke3 · March 1, 2019, 10:17am

@awais.hafeez,

Actually I found that the issue is with my pattern which has new line. I have below content which needs to be find using pattern,
<RTE_Update_Summary_ContentData> Summary

${file:RTE_Update_Summary_ContentData.docx}

</RTE_Update_Summary_ContentData>

I searched with below pattern
\<.\>.\n\$\{file:.\}\n\</.\>
This is working in java code ( pattern marcher ) however it’s not working in ASPOSE code ( in replace method)
tried many combination,
(\<RTE_._ContentData\>.(.|\n)\$\{file:.\}.(.|\n)\</.\>).
When I use pattern \<.\>.* then it works for first line of content ( <RTE_Update_Summary_ContentData> Summary) which seems that new line problem is there.

sharing you the docx file which has content. Please suggest pattern to search that content.
File.zip (54.8 KB)

awais.hafeez · March 2, 2019, 12:47am

@vke3,

Please ZIP and attach a simplified Java application (source code without compilation errors) that helps us to reproduce this problem on our end. Thanks for your cooperation.

vke3 · March 4, 2019, 8:00am

@awais.hafeez,

Sharing you one java file which has code, Input file and expected output.
File.zip (109.2 KB)

RTE_BP2S_PDFTemplate_ContentData.docx file I have already shared you in zip folder.The data which needs to be match with the pattern is below,
<RTE_Update_Summary_ContentData> Summary

${file:RTE_Update_Summary_ContentData.docx}

</RTE_Update_Summary_ContentData>

and same pattern which are repeated in the docx file needs to be replace.

I tried to execute it with the full Pattern ("\<.\>.\n\$\{file:.\}\n\<.\>") and nothing got replaced without any error and exception.
Then I tried with splitting pattern and first executed using pattern ("\<.\>"), the matched string got replaced properly. Then I again add next pattern till ("\<.\>.*") I got the exception,
java.lang.UnsupportedOperationException: The match includes one or more special or break characters and cannot be replaced.
In Replace functions it has mentioned that, An exception is thrown if a captured or replacement string contain one or more special characters: paragraph break, cell break, section break, field start, field separator, field end, inline picture, drawing object, footnote.
If the replace won’t handle the new line then what should be the another way to replace it?

awais.hafeez · March 4, 2019, 10:10am

@vke3,

I am afraid, we will not be able to write regex/pattern expressions. You have to write and experiment with different expressions on your end.

However, one simple way to get the desired output is as follows:

Document doc = new Document("E:\\file\\RTE_BP2S_PDFTemplate_ContentData.docx");
DocumentBuilder builder = new DocumentBuilder(doc);

BookmarkEnd end = null;
int i = 0;
for (Paragraph para : (Iterable<Paragraph>) doc.getChildNodes(NodeType.PARAGRAPH, true)) {

    if (para.toString(SaveFormat.TEXT).startsWith("<RTE_Update_")) {
        builder.moveTo(para);
        BookmarkStart start = builder.startBookmark("bm_" + i);
        end = builder.endBookmark("bm_" + i);
        para.insertBefore(start, para.getFirstChild());

        i++;
    }

    if (para.toString(SaveFormat.TEXT).startsWith("</RTE_Update_")) {
        if (end != null) {
            para.insertAfter(end, para.getLastChild());
        }
    }
}

for (Bookmark bm : doc.getRange().getBookmarks()) {
    if (bm.getName().startsWith("bm_")) {
        bm.setText("Found --> " + bm.getName());
        bm.remove();
    }
}

doc.save("E:\\file\\awjava-19.2.docx");

Hope, this helps.

vke3 · March 5, 2019, 4:53am

@awais.hafeez,

Yes, this helped me to achieve my requirement. Thanks a lot for your support.

awais.hafeez · March 5, 2019, 6:49am

@vke3,

It is great that you were able to achieve what you were looking for. Please let us know any time you have any further queries.

vke3 · March 11, 2019, 9:50am

@awais.hafeez,

I am looking to extract the content from bookmark ( bookmark may content text, shape, table and comments).
I didn’t found a way on aspose site which extract any kind of data from bookmark. Could you please help in that.? Currently I am able to extract text only but not the table or image (seems only paragraph node is extracted).
I have shared one input file contents one starting and ending token ( start token is the start of bookmark and end token as end of bookmark) and bookmark content ( text, table, image and comments) which I need to extract ( extracted content should keep the all of input doc style).
4 output files in which the extracted data should be stored ( Number of output files will be equal to number of bookmark in input file)
Code snippet ( not compiled as it has some different api) which has a business logic.
File.zip (335.8 KB)

Thanks

awais.hafeez · March 11, 2019, 1:31pm

@vke3,

I believe, you can meet this requirement after reading the following article:
Extract Selected Content Between Nodes

vke3 · March 13, 2019, 6:14am

@awais.hafeez,

Yes I refer the same link to extract bookmark, I found the way to extract those.
Thanks.

awais.hafeez · March 13, 2019, 6:36am

@vke3,

It is great that you were able to resolve this issue on your end. Please let us know any time you have any further queries.

vke3 · March 18, 2019, 10:02am

@awais.hafeez,

I have one query, In one of the Above Comment you have given a way to set the bookmark and remove it. You have used bm.setText(). Set text methods sets value to bookmark. now I want to remove that bookmark line itself from docx file. It means when I delete any full line in docx file then the next line goes on that place cursor position gets changed. Similarly I want after deleting bookmark the next line should come on the line of bookmark place. This will solve my blank line problem.

Thanks.

awais.hafeez · March 18, 2019, 3:01pm

@vke3,

Please try the following code:

Bookmark bm = doc.getRange().getBookmarks().get("SomeBookmark");
Paragraph para = (Paragraph)bm.getBookmarkStart().getAncestor(NodeType.PARAGRAPH);
if (para != null){
    // check if it is empty, then remove it
}

Hope, this helps.

vke3 · March 29, 2019, 12:17pm

@awais.hafeez,

Hello,
I am facing one new issue. I have one input file where I have added the bookmark the way you suggested above.
I have different docx files which has the same name of the bookmark in input file. Now I have to read that different files one by one, get the bookmark of similar name of file. Read the content from the file and add it inside the bookmark of input file.
I have start token of bookmark, one blank line (paragraph) and then the end bookmark token like,
[StartBookmark]

[/EndBookmark]
I want to add the content at that blank line ( paragraph) however it’s get added the inside the bookmark like,
[StartBookmark]
// Add content
[/EndBookmark]
Could you please have a look and comment on it.
I have shared you the all input file, expected output file, output I am getting and the other files from which I have to add the data into input files. Code snippet is there but it’s not the compiled as I have some different api other than java and to understand the logic I have kept it as it is.
There is one output.png file I have taken the screenshot of output.docx file which will give the more idea of issue.
File.zip (308.9 KB)

Thanks

awais.hafeez · March 30, 2019, 1:47am

@vke3,

If I understand you correctly, removing those empty paragraphs will resolve this issue. And you can use this code to detect empty paragraphs: para.toString(SaveFormat.TEXT).trim().equals(""). Please let me know if I can be of any further assistance.

vke3 · April 1, 2019, 6:02am

@awais.hafeez,

Which empty paragraph?
[StartBookmark]
**** empty paragraph?
[/EndBookmark]

Is this the same empty paragraph which I * above?
If yes then I can’t remove that.
I have to add the content at that same empty paragraph and keep the bookmark also as it is.

awais.hafeez · April 1, 2019, 3:11pm

@vke3,

Please check the following simple code to remove unwanted/empty paragraph against “RTE_Update_EffectiveDate_ContentData.docx”.

Document srcdoc = new Document("E:\\File\\RTE_Update_EffectiveDate_ContentData.docx");
Document doc = new Document("E:\\File\\Input File.docx");
DocumentBuilder builder = new DocumentBuilder(doc);
Paragraph para = doc.FirstSection.Body.Paragraphs[1];
builder.MoveTo(para);
builder.InsertDocument(srcdoc, ImportFormatMode.KeepSourceFormatting);
builder.CurrentParagraph.Remove();
doc.Save("E:\\File\\19.3.docx");

vke3 · April 2, 2019, 5:56am

@awais.hafeez,

Hello,
Well I don’t want to remove the paragraph from “RTE_Update_EffectiveDate_ContentData.docx”.
I want to add the content of “RTE_Update_EffectiveDate_ContentData.docx” to “Input File.docx”.

If you see the “Input File.docx”, it has hidden tokens <RTE_Update_EffectiveDate_ContentData>. This is the start of my bookmark. after that start token there is one empty line and after the empty line there is another hidden token </RTE_Update_EffectiveDate_ContentData> which is the end of bookmark.

So if you see the format of multi-line bookmark is like,
<RTE_Update_EffectiveDate_ContentData> -----> Start of bookmark
------> Empty line/paragraph
</RTE_Update_EffectiveDate_ContentData> ------> End of bookmark.

Now whatever content “RTE_Update_EffectiveDate_ContentData.docx”. file has, should get added into input file of respective bookmark. Consider my “RTE_Update_EffectiveDate_ContentData.docx” file has content HELLO WORLD. So expected output should be like,

<RTE_Update_EffectiveDate_ContentData>
HELLO WORLD
</RTE_Update_EffectiveDate_ContentData>

Currently I have shared one code snippet which gives me following result,
<RTE_Update_EffectiveDate_ContentData>

HELLO WORLD </RTE_Update_EffectiveDate_ContentData>

So could you please share the details why this is happening and what will be changes in code to get expected result. you can refer the shared code snippet to check what I have implemented.
Thank you for your continues support.

awais.hafeez · April 2, 2019, 1:32pm

@vke3,

Please note that the DocumentBuilder.InsertDocument method mimics the MS Word behavior, as if CTRL+‘A’ (select all content) was pressed, then CTRL+‘C’ (copy selected into the buffer) inside one document and then CTRL+‘V’ (insert content from the buffer) inside another document. You can do the following steps to verify this:

Copy all content of ‘RTE_Update_EffectiveDate_ContentData.docx’ by using MS Word
Open ‘Input File.docx’ with MS Word
Move cursor to the first empty Paragraph location where you want to paste the copied content
Paste the content
Now, notice that even MS Word leaves that empty Paragraph as is

You need to write some logic to get rid of such empty Paragraphs like the one I shared in my previous post.