Verify if Internal Hyperlinks go to Correct Bookmarked Page Locations in Word Document using C# .NET | Find Hyperlink Fields using Regex

knr · July 16, 2020, 5:08am

I have a word document it contains lot of internal hyperlinks with italic format, once you click it navigate to exact page.

i have verified the hyperlink is italic and i am getting the hyperlink subaddress like page1, page2 …etc

awais.hafeez · July 16, 2020, 5:48pm

@knr,

Please see attached a sample Word DOCX document (input 215819.zip (10.0 KB)). It contains two internal HYPERLINK fields. The first link is pointing to a Bookmark on second page while the second HYPERLINK points to a bookmark at the last (4th) page.

Do you need to make sure that the referenced bookmark is located on the right page? If yes, then you should obtain the bookmark via its name and use the ‘document layout’ facility to determine on which page the bookmark’s start or end is located.

Document doc = new Document("E:\\Temp\\input.docx");
LayoutCollector collector = new LayoutCollector(doc);

foreach (Field field in doc.Range.Fields)
{
    if (field.Type == FieldType.FieldHyperlink)
    {
        FieldHyperlink link = (FieldHyperlink)field;
        if (!string.IsNullOrEmpty(link.SubAddress.Trim()))
        {
            Bookmark bm = doc.Range.Bookmarks[link.SubAddress.Trim()];
            Console.WriteLine("Bookmark {0} is on Page {1}", link.SubAddress, collector.GetStartPageIndex(bm.BookmarkStart));
        }
    }
}

If this is not what you are looking for then please elaborate your inquiry further by providing complete details of your usecase along with simplified Word document. This will help us to understand your scenario, and we will be in a better position to address your concerns accordingly.

knr · July 17, 2020, 10:39am

Thank you for providing the information.
But i am getting all the hyperlinks processing, i don’t want all the hyperlinks. i want only with in the squre bracket information.1 INDICATIONS AND USAGE.zip (16.7 KB)

Can you refer the uploaded file:
In that file [see Clinical Studies(14.2)] , so i need to verify the with in the [ ] string should be italic. With hyperlink.

Other than square brackets hyperlinks, it contain 5.2, 3.1,www.fda.com likeee…etc.Just i need only square brackets information.

awais.hafeez · July 17, 2020, 4:28pm

@knr,

Unfortunately, we are unable to download your ZIP file. Windows Defender stops this file from downloading. It seems that “1 INDICATIONS AND USAGE.zip” is infected with some Virus. Could you please attach Virus Free documents/files again here for testing? We will then investigate the scenario on our side and provide you more information.

knr · July 20, 2020, 6:03am

1 INDICATIONS AND USAGE.zip (16.7 KB)

awais.hafeez · July 20, 2020, 2:53pm

@knr,

Please try the following code:

Document doc = new Document("E:\\Temp\\1 INDICATIONS AND USAGE\\1 INDICATIONS AND USAGE.docx");
// doc.UpdateFields();

FindAndReplace handler = new FindAndReplace();
FindReplaceOptions opts = new FindReplaceOptions();
opts.Direction = FindReplaceDirection.Backward;
opts.ReplacingCallback = handler;

string searchPattern = @"\[(.*?)\]";
doc.Range.Replace(new Regex(searchPattern), "", opts);

private class FindAndReplace : IReplacingCallback
{
    /// <summary>
    /// NOTE: This is a simplistic method that will only work well when the match
    /// starts at the beginning of a run.
    /// </summary>
    ReplaceAction IReplacingCallback.Replacing(ReplacingArgs e)
    {
        #region pre process
        // This is a Run node that contains either the beginning or the complete match.
        Node currentNode = e.MatchNode;

        // The first (and may be the only) run can contain text before the match,
        // in this case it is necessary to split the run.
        if (e.MatchOffset > 0)
            currentNode = SplitRun((Run)currentNode, e.MatchOffset);

        // This array is used to store all nodes of the match for further removing.
        ArrayList runs = new ArrayList();

        // Find all runs that contain parts of the match string.
        int remainingLength = e.Match.Value.Length;
        while (
            (remainingLength > 0) &&
            (currentNode != null) &&
            (currentNode.GetText().Length <= remainingLength))
        {
            runs.Add(currentNode);
            remainingLength = remainingLength - currentNode.GetText().Length;

            // Select the next Run node.
            // Have to loop because there could be other nodes such as BookmarkStart etc.
            do
            {
                currentNode = currentNode.NextSibling;
            }
            while ((currentNode != null) && (currentNode.NodeType != NodeType.Run));
        }

        // Split the last run that contains the match if there is any text left.
        if ((currentNode != null) && (remainingLength > 0))
        {
            SplitRun((Run)currentNode, remainingLength);
            runs.Add(currentNode);
        }
        #endregion

        // Determine if all text between [ and ] is italic
        bool isAllItalic = true;
        foreach (Run run in runs)
        {
            if (!string.IsNullOrEmpty(run.Text.Trim()) && !run.Font.Italic)
            {
                isAllItalic = false;
                break;
            }

        }

        // If it is all Italic then collect Hyperlink fields between [ and ]
        if (isAllItalic)
        {
            Console.WriteLine("All text is Italic");

            ArrayList hyperlinks = new ArrayList();
            Node aNode = (Node)runs[0];
            while (aNode != null)
            {
                if (aNode == runs[runs.Count - 1])
                    break;

                if (aNode.NodeType == NodeType.FieldStart)
                {
                    if (((FieldStart)aNode).FieldType == FieldType.FieldHyperlink)
                    {
                        hyperlinks.Add(((FieldStart)aNode).GetField());
                    }
                }

                Node nextNode = aNode.NextPreOrder(aNode.Document);
                aNode = nextNode;
            }
        }

        return ReplaceAction.Skip;
    }

    private static Run SplitRun(Run run, int position)
    {
        Run afterRun = (Run)run.Clone(true);
        afterRun.Text = run.Text.Substring(position);
        run.Text = run.Text.Substring((0), (0) + (position));
        run.ParentNode.InsertAfter(afterRun, run);
        return afterRun;
    }
}

knr · July 21, 2020, 9:56am

Excellent it working…Thank you providing the information:grinning: