Extract Text from Multiple Pages

Hi,

I have a scenario where I need to extract text between starting and ending text. The start and end of the text can be on different pages. Is there a solution in Aspose.word to perform this scenario.

Regards,

Hi Rajeev,
Yes, you can do that. Please check https://docs.aspose.com/words/java/extract-selected-content-between-nodes/ for more details and let us know if you see any issue.
Best Regards,

Hi Ijaz,

Thanks for your response. I did tried the code and was able to extract text when it is a simple scenario. I am facing the following challenges.

  1. How to find that there is no match for the word. Since in my case the end text till where I need to extract can be multiple in case one of them is not found.
  2. How do I search for Hard Return and Text. I am trying to make sure that my starting search word starts from a new paragraph.

Regards,

Hi Rajeev,

  1. As shared earlier, you can use the example from https://docs.aspose.com/words/java/find-and-replace/ to find a word. In this example, IReplacingCallback.Replacing will be called for every match. You can declare a counter variable and use counter++ in Replacing event. If counter value will be less than 1, it means no match found.
  2. Once string is matched, you can use Run.ParentParagraph.Range.Text.StartsWith to check if search word/phrase is the starting word/phrase of the paragraph.
    Best Regards,

Hi Ijaz,

Thanks for providing the solutions. Using the first approach IO have been bookmarking the start and end of the search text. In a scenario where there is no end I need to get 100 characters or words starting from starting bookmark. How can I achieve that using Aspose.words for java.

Regards,

Hi Rajeev,
Can you please share your sample document and expected output string (after getting 100 characters or words)?
Best Regards,

Hi Ijaz,

I am using the following code for adding bookmarks in the existing document and then I am passing the starting and ending node to extractContents but it does not seems to be working. After adding bookmark I am saving the document but when I open the document I see the square brackets for starting bookmark but not for ending bookmark.

DocumentBuilder docBuilder = new DocumentBuilder(document);
// Move cursor to document start and insert bookmark
docBuilder.moveToDocumentStart();

NodeCollection paragraphs = summaryStatement.getChildNodes(NodeType.PARAGRAPH, true);
// Look through all paragraphs to find those with the specified style.
for (Paragraph paragraph : (Iterable)paragraphs)
{
    if (paragraph.toString(SaveFormat.TEXT).trim().startsWith("Starting Text"))
    {
        docBuilder.startBookmark("BookMark1");
    }
    if (paragraph.toString(SaveFormat.TEXT).trim().startsWith("Ending Text"))
    {
        docBuilder.endBookmark("BookMark1");
        break;
    }
}

Bookmark bookmark1 = summaryStatement.getRange().getBookmarks().get("BookMark1");
ArrayList nodes = null;
Document newdoc = null;
nodes = extractContent(bookmark1.getBookmarkStart(), bookmark1.getBookmarkEnd(), true);
newdoc = generateDocument(summaryStatement, nodes);

When I print newDoc.getText(), it displays blank. Even if I print bookmark1.getText() that is also blank.

Can you please help me understand if I am doing sonething wrong.

Regards,

Hi Rajeev,
Looks like you are first moving to start of document (docBuilder.moveToDocumentStart) and then starting and ending bookmark at the same place so there is no text in the bookmark. You should properly move the cursor to starting and ending positions of the bookmark as you can see in the following code.

DocumentBuilder docBuilder = new DocumentBuilder(doc);
docBuilder.moveToDocumentStart();
NodeCollection paragraphs = doc.getChildNodes(NodeType.PARAGRAPH, true);

// Look through all paragraphs to find those with the specified style.
for (Paragraph paragraph : (Iterable)paragraphs)
{
    if (paragraph.toString(SaveFormat.TEXT).trim().startsWith("StartText"))
    {
        docBuilder.moveTo(paragraph.getChildNodes(NodeType.RUN, true).get(0));
        docBuilder.startBookmark("BookMark1");
    }
    if (paragraph.toString(SaveFormat.TEXT).trim().startsWith("EndText"))
    {
        docBuilder.moveTo(paragraph.getChildNodes(NodeType.RUN, true).get(0));
        docBuilder.endBookmark("BookMark1");
        break;
    }
}
System.out.print(doc.getRange().getBookmarks().get("BookMark1").getText());

Best Regards,