Cannot extract an link from a given pdf


#1

Dear All,

I tried to extract a link from a PDF (see attached file below). I tried the following code (VB.NET):

 Private Function extractUrlsFromPDF(ByVal pdfPath As String) As String()
   Dim doc As Pdf.Document = New Pdf.Document(pdfPath)
   Dim urls As List(Of String) = New List(Of String)
   For Each page As Pdf.Page In doc.Pages
     Dim selector As Pdf.Annotations.AnnotationSelector =
     New Pdf.Annotations.AnnotationSelector(New Pdf.Annotations.LinkAnnotation(page, Aspose.Pdf.Rectangle.Trivial))
     page.Accept(selector)
     Dim list As IList(Of Pdf.Annotations.Annotation) = selector.Selected
     For Each linkAnnotation As Pdf.Annotations.LinkAnnotation In list
       If linkAnnotation.Action Is Nothing Then Continue For
       Dim url As String = vbNullString
       If linkAnnotation.Action.GetType() Is GetType(Pdf.Annotations.GoToRemoteAction) Then
         url = CType(linkAnnotation.Action, Pdf.Annotations.GoToRemoteAction).File.Name
       End If
       If linkAnnotation.Action.GetType() Is GetType(Pdf.Annotations.LaunchAction) Then
         url = CType(linkAnnotation.Action, Pdf.Annotations.LaunchAction).File
       End If
       If linkAnnotation.Action.GetType() Is GetType(Pdf.Annotations.GoToURIAction) Then
         url = CType(linkAnnotation.Action, Pdf.Annotations.GoToURIAction).URI
       End If
       If url <> vbNullString Then urls.Add(url)
     Next
   Next
   Return urls.OrderBy(Function(obj) obj).ToArray
 End Function

Howerver, the URL isn’t selected as a GoToRemoteAction, LauchAction, neither GoToURIAction. Why? What code could I use to catch this URL? I wrote a VB.NET sample but C# or whatever .NET compatible solution would be fine.

Regards.

Attached file: sample.pdf (64.6 KB)


#2

@monir.aittahar

Thank you for contacting support.

Please try using below code snippet in your environment and then share your kind feedback with us.

// Load the PDF file
Document document = new Document(dataDir + "sample.pdf");
// Traverse through all the page of PDF
foreach (Aspose.Pdf.Page page in document.Pages)
{
    // Get the link annotations from particular page
    AnnotationSelector selector = new AnnotationSelector(new Aspose.Pdf.Annotations.LinkAnnotation(page, Aspose.Pdf.Rectangle.Trivial));

    page.Accept(selector);
    // Create list holding all the links
    IList<Annotation> list = selector.Selected;
    // Iterate through invidiaul item inside list
    foreach (LinkAnnotation a in list)
    {
        if (!(a.Action as Aspose.Pdf.Annotations.GoToURIAction is null))
        {
            // Print the destination URL
            Console.WriteLine("\nDestination: " + (a.Action as Aspose.Pdf.Annotations.GoToURIAction).URI + "\n");
        }
    }
}

#3

@Farhan.Raza,

Thanks for your reply. I translated your code sample into this one in VB.NET:

Private Sub extractUrlsFromPDF2(ByVal pdfPath As String)
  Dim document As Pdf.Document = New Pdf.Document(pdfPath)
  For Each page As Pdf.Page In document.Pages
    Dim selector As Pdf.Annotations.AnnotationSelector =
      New Pdf.Annotations.AnnotationSelector(New Aspose.Pdf.Annotations.LinkAnnotation(page, Pdf.Rectangle.Trivial))
    page.Accept(selector)
    Dim list As IList(Of Pdf.Annotations.Annotation) = selector.Selected
    For Each a As Pdf.Annotations.LinkAnnotation In list
      If a.Action.GetType() Is GetType(Pdf.Annotations.GoToURIAction) Then
        Console.WriteLine(vbNewLine & "Destination: " & CType(a.Action, Pdf.Annotations.GoToURIAction).URI)
      End If
    Next
  Next
End Sub

The line Console.Writeline is not reached, ie no action of type GoToURIAction was found.


#4

@monir.aittahar

The same code is working on our side in C# as well as VB.NET and the link is printed at console. Would you please ensure using Aspose.PDF for .NET 19.9 and share a narrowed down sample application if you still notice the problem.

Moreover, in case you do not have a valid license, please consider applying for free 30-days temporary license in order to test the API in its full capacity.