Get Page numbers of single or multiple occurrence of search terms based on the offset or start and end range of the search terms in word document

Hi,

I am trying to find the page numbers of single or multiple occurrences of search terms based on the offset or start- end range of the search terms in word document.

I am able to accomplish finding the page numbers of single or multiple occurrences of search terms in word document using the Aspose.Words.Replacing - FindReplaceOptions but I have a scenario in which this approach does not give me expected results.

Scenario1 - There will be multiple occurrences of search term say “ASWPF” one in 2nd page, another in 4th page and last one in 5th page. I will have to map the 2nd page occurrence to A and 4th page occurrence to B and 5th page occurrence to C. So, my custom class FindSearchTermAndGetPageNumber using Aspose.Words.Replacing - FindReplaceOptions approach will fetch all the page numbers irrespective of where the search term is located.
It will be helpful if i am able to find the page number of search term based on the offset or start-end range of that search term. My input will be document content, search term and offset value (start-end range value or the position of search term in document).

Thanks.

@chetan85 you can use the following implementation to get the offset of the term to search:

Document doc = new Document("C:\\Temp\\input.docx");
var callback = new ReplaceForPageCounterCallBack(doc);
doc.Range.Replace("ASWPF ASWPF", string.Empty, new FindReplaceOptions(callback));

foreach(var item in callback.ElemPages)
{
    Console.WriteLine($"Node: {item.Node.NodeType.ToString()} Search: {item.SearchTerm} Start Page: {item.StartsAt} End Page: {item.StartsAt}");
}

public class ReplaceForPageCounterCallBack : IReplacingCallback
{
    private readonly LayoutCollector _collector;

    private List<PagePosition> elemPages;

    public List<PagePosition> ElemPages { get => elemPages; }
    public ReplaceForPageCounterCallBack(Document doc)
    {
        elemPages = new List<PagePosition>();
        _collector = new LayoutCollector(doc);
    }

    public ReplaceAction Replacing(ReplacingArgs args)
    {
        var start = _collector.GetStartPageIndex(args.MatchNode);
        var end = _collector.GetEndPageIndex(args.MatchNode);

        elemPages.Add(new PagePosition()
        {
            Node = args.MatchNode,
            StartsAt = start,
            EndsAt = end,
            SearchTerm = args.Match.Value
        });

        return ReplaceAction.Skip;
    }
}

public class PagePosition
{
    public Node Node { get; set; }
    public int StartsAt { get; set; }
    public int EndsAt { get; set; }
    public string SearchTerm { get; set; }
}

You can learn more about LayoutCollector class following this link

Thanks for the reply, I had written almost the same code for my requirement, but this fails if we have same search terms existing in multiple pages.

Ex - Lets consider we have a List in which string is our search term present in document.
Lets say we have below search terms -
[0] - “aspose” - this is in first page in the document
[1] - “family” - this is in first page in the document
[2] - “eduardo” - this is in first page in the document
[3] - “aspose” - this is in second page in the document
[4] - “chetan” - this is in second page in the document
[5] - “aspose” - this is in second page in the document

In the above list, search term “aspose” is repeated thrice in the document and using the code that you have shared the callback.ElemPages will be formed as shown in below screenshot-

image.png (41.6 KB)

Since my search terms are only 6 and I would need the page numbers of only these search terms. We are getting 12 items in the callback.ElemPages list and this is because ReplacingCallback is called thrice for each occurrence of word “aspose” as it exists three time in the document.

In Office.Word.Interop libraries we were able to find the Range of particular word or text in the document which is the position of the search term (“aspose”) in document and search term’s Range’s start values will be like - first occurrence is at 244, 2nd occurrence is at 1244 and 3rd occurrence is at 2455.

I was looking for some way in Aspose to pass the range (position) of the particular search term and find the page number of it so that I would get the exact page number of search terms. I think we don’t have such option in Aspose. Please let me know if we have this.

My search terms list is ordered as in the order of appearance in the document. So, I tried below approach.

For first “aspose” term in the search list, ReplacingCallback will be called thrice since it has 3 occurrences in the document. In first call I am finding the page number of first “aspose” node and trying to insert Field element like builder.MoveTo(currentNode); builder.InsertField(“SkipSearch”); and adding return ReplaceAction.Stop. So that we skip 2 mores callbacks for the same search term and go to the next item searchlist[1] (“family”) in search list.

When it reaches searchlist[3] i.e., second occurrence of “aspose”, I am checking if the previous sibling is Field element like this - if (currentNode.PreviousSibling.NodeType == NodeType.FieldEnd) this way I can avoid the node whose page number is already calculated.

But in the above approach when I insert Field element the page numbers of search terms are getting changed because when this code - builder.InsertField(“SkipSearch”) is executed the user document is altered and also its page number.

Please let me know which would be the best way to mark the node whose page number is already calculated and at the same time we need to make sure that page numbers of search terms are not getting altered due to adding something to the document.

I tried my best to explain the scenario, please let me know if you have any queries. Thanks.

I think this text - Error! Bookmark not defined is added when i insert the field which is moving the pages.
I tried this --> builder.InsertField(“SkipSearch”,null); instead of this builder.InsertField(“SkipSearch”) and i am not getting any additional text inserted to document. So, now i am getting expected result.

Please share if you have any other simple approach to achieve my requirement. Thanks!

@chetan85 It looks like you are searching the word in the loop one by one. Probably in your case you should search all words in the one Range.Replace call. In this case, the words page numbers will be calculated in the same order they appear in the document. To achieve this you can use regular expression:

string[] wordsToSearch = new string[] { "aspose", "family", "eduardo", "chetan" };

Document doc = new Document(@"C:\Temp\in.docx");

PageNumbersCollector collector = new PageNumbersCollector(doc);
Regex regex = new Regex(string.Join("|", wordsToSearch));

doc.Range.Replace(regex, "", new FindReplaceOptions() { ReplacingCallback = collector });

foreach (WordToPage wordToPage in collector.WordsToPages)
    Console.WriteLine("{0} - {1}", wordToPage.Word, wordToPage.Page);
private class PageNumbersCollector : IReplacingCallback
{
    public PageNumbersCollector(Document doc)
    {
        mCollector = new LayoutCollector(doc);
    }

    public ReplaceAction Replacing(ReplacingArgs args)
    {
        string word = args.Match.Value;
        int page = mCollector.GetStartPageIndex(args.MatchNode);
        WordsToPages.Add(new WordToPage(word, page));
        return ReplaceAction.Skip;
    }

    public List<WordToPage> WordsToPages = new List<WordToPage>();

    private LayoutCollector mCollector;
}

private class WordToPage
{
    public WordToPage(string word, int page)
    {
        Word = word;
        Page = page;
    }

    public string Word { get; set; }
    public int Page { get; set; }
}

FYI @eduardo.canal

@chetan85 this is an issue related with the Regex expression which is out of the scope of this forum. I recommend to you use a tool to evaluate your expression first (like regex101.com, following the link you will found a base example for you), also if your expression include specials characters make sure to scape those (\).

Thanks for supporting will take it from here.

1 Like

@chetan85 You can use Regex.Escape method to escape the reserved characters in your strings:

string[] wordsToSearch = new string[] { "aspose|(", "family", "eduardo", "chetan" };
for (int i = 0; i < wordsToSearch.Length; i++)
    wordsToSearch[i] = Regex.Escape(wordsToSearch[i]);

Regex regex = new Regex(string.Join("|", wordsToSearch));

Thanks Alexey, that helped.

1 Like