Extract Text from Multiple Pages

rajeevkrmathur · August 28, 2015, 1:54pm

Hi,

I have a scenario where I need to extract text between starting and ending text. The start and end of the text can be on different pages. Is there a solution in Aspose.word to perform this scenario.

Regards,

muhammad.ijaz · August 31, 2015, 5:01am

Hi Rajeev,
Yes, you can do that. Please check https://docs.aspose.com/words/java/extract-selected-content-between-nodes/ for more details and let us know if you see any issue.
Best Regards,

rajeevkrmathur · September 3, 2015, 4:19pm

Hi Ijaz,

Thanks for your response. I did tried the code and was able to extract text when it is a simple scenario. I am facing the following challenges.

How to find that there is no match for the word. Since in my case the end text till where I need to extract can be multiple in case one of them is not found.
How do I search for Hard Return and Text. I am trying to make sure that my starting search word starts from a new paragraph.

Regards,

muhammad.ijaz · September 7, 2015, 11:18am

Hi Rajeev,

As shared earlier, you can use the example from https://docs.aspose.com/words/java/find-and-replace/ to find a word. In this example, IReplacingCallback.Replacing will be called for every match. You can declare a counter variable and use counter++ in Replacing event. If counter value will be less than 1, it means no match found.
Once string is matched, you can use Run.ParentParagraph.Range.Text.StartsWith to check if search word/phrase is the starting word/phrase of the paragraph.
Best Regards,

rajeevkrmathur · September 7, 2015, 11:28am

Hi Ijaz,

Thanks for providing the solutions. Using the first approach IO have been bookmarking the start and end of the search text. In a scenario where there is no end I need to get 100 characters or words starting from starting bookmark. How can I achieve that using Aspose.words for java.

Regards,

muhammad.ijaz · September 8, 2015, 10:25am

Hi Rajeev,
Can you please share your sample document and expected output string (after getting 100 characters or words)?
Best Regards,

rajeevkrmathur · September 8, 2015, 9:45pm

Hi Ijaz,

I am using the following code for adding bookmarks in the existing document and then I am passing the starting and ending node to extractContents but it does not seems to be working. After adding bookmark I am saving the document but when I open the document I see the square brackets for starting bookmark but not for ending bookmark.

DocumentBuilder docBuilder = new DocumentBuilder(document);
// Move cursor to document start and insert bookmark
docBuilder.moveToDocumentStart();

NodeCollection paragraphs = summaryStatement.getChildNodes(NodeType.PARAGRAPH, true);
// Look through all paragraphs to find those with the specified style.
for (Paragraph paragraph : (Iterable)paragraphs)
{
    if (paragraph.toString(SaveFormat.TEXT).trim().startsWith("Starting Text"))
    {
        docBuilder.startBookmark("BookMark1");
    }
    if (paragraph.toString(SaveFormat.TEXT).trim().startsWith("Ending Text"))
    {
        docBuilder.endBookmark("BookMark1");
        break;
    }
}

Bookmark bookmark1 = summaryStatement.getRange().getBookmarks().get("BookMark1");
ArrayList nodes = null;
Document newdoc = null;
nodes = extractContent(bookmark1.getBookmarkStart(), bookmark1.getBookmarkEnd(), true);
newdoc = generateDocument(summaryStatement, nodes);

When I print newDoc.getText(), it displays blank. Even if I print bookmark1.getText() that is also blank.

Can you please help me understand if I am doing sonething wrong.

Regards,

muhammad.ijaz · September 9, 2015, 9:31pm

Hi Rajeev,
Looks like you are first moving to start of document (docBuilder.moveToDocumentStart) and then starting and ending bookmark at the same place so there is no text in the bookmark. You should properly move the cursor to starting and ending positions of the bookmark as you can see in the following code.

DocumentBuilder docBuilder = new DocumentBuilder(doc);
docBuilder.moveToDocumentStart();
NodeCollection paragraphs = doc.getChildNodes(NodeType.PARAGRAPH, true);

// Look through all paragraphs to find those with the specified style.
for (Paragraph paragraph : (Iterable)paragraphs)
{
    if (paragraph.toString(SaveFormat.TEXT).trim().startsWith("StartText"))
    {
        docBuilder.moveTo(paragraph.getChildNodes(NodeType.RUN, true).get(0));
        docBuilder.startBookmark("BookMark1");
    }
    if (paragraph.toString(SaveFormat.TEXT).trim().startsWith("EndText"))
    {
        docBuilder.moveTo(paragraph.getChildNodes(NodeType.RUN, true).get(0));
        docBuilder.endBookmark("BookMark1");
        break;
    }
}
System.out.print(doc.getRange().getBookmarks().get("BookMark1").getText());

Best Regards,