Getting a Hypertext link the same way Word does on mouse click or copy-pasting

monir.aittahar · April 28, 2021, 12:19pm

Dear all,

We want to extract links from Word documents with Aspose (and also Excel worksheets and Powerpoint presentations, but it may be another topic to ask), the same way Microsoft Word does when a user clicks on it or copy it to the clipboard.

Indeed, we realized Microsoft Word somehow decodes thoses links and especially and “fixes” them if they are (even badly) percent-encoded.

The sample below shows two Word hyperlinks to a remote location:

First Link {HYPERLINK "\\\\XYZ\\X2%20YYYéYZZZ\\loremipsum"}
Second Link { HYPERLINK "file:///\\\\XYZ\\X2%2520YYYéYZZZ\\loremipsum" }

Those links are modfied on click or copy-pasting like this: \\XYZ\X2 YYYéYZZZ\loremipsum

This doesn’t seem to be a trivial processing. As we can see, original links are both misenconded :

é should be percent-encoded too (%c3%a9)
spaces are encoded once in the first link ( “ ” → %20) and twice in the second one ( “ ” → %20 → %2520 )

Before writing code handling those cases - and potentially dealing with edge ones - does Aspose provides a feature to get those links like Microsoft Word does?

You’ll find a DOCX sample file with the said hyperlinks.

Best regards,
Monir

hyperlinks_with_encoded_urls.docx (37.2 KB)

tahir.manzoor · April 28, 2021, 3:07pm

@monir.aittahar

Aspose.Words does not provide APIs to encode or decode the hyperlink text. With Aspose.Words, you can get the hyperlink value as shown in MS Word.

If you unzip your document and check the document.xml, the hyperlink text is same as returned by Aspose.Words. So, Aspose.Words reads the hyperlink content correctly. Please check the attached images for detail.

image.png
image.png

Could you please share some more detail about your requirement along with expected value of hyperlink? We will then provide you more information about your query.

monir.aittahar · April 28, 2021, 11:58pm

Hello @tahir.manzoor,

Thanks for your answer. I understand my question could be out of the scope since it focuses more on Word behavior than the file content itself.

Our client lists hyperlinks from a bunch of stored document to check if they are broken or not. If we extract those URL “as is”, they won’t work since they are badly encoded. Microsoft Word, on the other hand, seems to figure out that a remote path could have been encoded twice (despite partially, since é character isn’t) when a user clicks on it or copies it, therefore produces valid URLs in the given cases (it doesn’t seem to work with HTTP links though).

I was wondering if, with Aspose, we could mimick Word behavior regarding link retrieval.

Best regards,
Monir

tahir.manzoor · April 29, 2021, 9:18am

@monir.aittahar

With Aspose.Words, you can extract the content of document from it. If you copy the link from MS Word and it works at your end, you can use Aspose.Words to copy/extract the link from document. It will also work.

Could you please ZIP and attach your problematic output word document that is generated by Aspose.Words and expected output Word document? We will then provide you more information on it.

monir.aittahar · April 29, 2021, 2:43pm

@tahir.manzoor,

Thanks you for your answer. Actually this is not about a Word document generated by Aspose, but rather about reproduce with Aspose the same behavior of Word at user copy/pasting a link to a browser or at user click on it. As you mentioned, it is more a feature of Word than linked to the DOCX format.

Step to reproduce:

Create a new Word document
Add a field { HYPERLINK "file:///\\\\XYZ\\X2%2520YYYéYZZZ\\loremipsum" }
Mask field code (ALT+F9)
Copy link (mouse right click → Copy hypertext link)
Paste link into a notepad window: link pasted got auto-corrected (\\XYZ\X2 YYYéYZZZ\loremipsum)

That said, my feeling about this is that the more we talk about it, the more obvious it seems to me that it is out scope of Aspose.

Best regards,
Monir

tahir.manzoor · April 29, 2021, 4:49pm

@monir.aittahar

Aspose.Words returns \\XYZ\X2%20YYYéYZZZ\loremipsum for hyperlink output using following code example. This is correct behavior of Aspose.Words. You can convert the string using .NET String APIs.

Document doc = new Document(MyDir + "hyperlinks_with_encoded_urls (1).docx");
foreach (Field field in doc.Range.Fields)
{
    if (field.Type == FieldType.FieldHyperlink)
    {
        FieldHyperlink link = (FieldHyperlink)field;
        Console.WriteLine(link.Address);
    }
}

monir.aittahar · April 29, 2021, 6:07pm

@tahir.manzoor,

I understand the point. Thanks for answers.

Best regards,
Monir