Hi, Support:
Would you please provide me a full demo based on VB.net and V23.4 to reach this goals to extract any type of audios and videos embedded in pdf as well as images? the type of embedded audios/videos may be annotation or ScreenAnnotation or RichMediaAnnotation or EmbeddedFile. In this demo, it should show the code how to get the type,name,size,x-y postion,w-h size of each element to be extract, and detect whether those element exists in the pdf or given page of the pdf, and get the total count of each type of the element in the whole pdf or given page of the pdf.
Hope your help!
Thanks a lot!
@ducaisoft
Below is the sample code snippet to extract rich media from the PDF:
Dim pdfDocument As New Document(dataDir & "Cannot_Extract_Audios.pdf")
For Each page As Page In pdfDocument.Pages
For Each annotation As Annotation In page.Annotations
If TypeOf annotation Is RichMediaAnnotation Then
Dim ann As RichMediaAnnotation = CType(annotation, RichMediaAnnotation)
Using stream As New MemoryStream()
ann.Content.CopyTo(stream)
Dim fs As New FileStream(dataDir & "extractedaudio.mpa", FileMode.CreateNew)
ann.Content.CopyTo(fs)
fs.Close()
End Using
Dim rect = ann.Rect
Dim height = ann.Height
Dim width = ann.Width
End If
If TypeOf annotation Is ScreenAnnotation Then
Dim ann As ScreenAnnotation = CType(annotation, ScreenAnnotation)
Dim renditionaction As RenditionAction = CType(ann.Action, RenditionAction)
Dim mediaRendition As MediaRendition = CType(renditionaction.Rendition, MediaRendition)
Dim mediaclip As MediaClipData = CType(mediaRendition.MediaClip, MediaClipData)
Dim fs As New FileStream(dataDir & mediaclip.Data.Name, FileMode.CreateNew)
mediaclip.Data.Contents.CopyTo(fs)
fs.Close()
End If
Next
Next
In order to extract images, you can use below code snippet:
Imports Aspose.Pdf
Imports Aspose.Pdf.Text
Imports System.IO
Module Module1
Sub Main()
' Load the PDF document
Dim pdfDocument As New Document("input.pdf")
' Initialize counters for images, audios, and videos
Dim imageCount As Integer = 0
Dim audioCount As Integer = 0
Dim videoCount As Integer = 0
' Iterate through the pages
For Each page As Page In pdfDocument.Pages
' Extract images from the page
Dim imagePlacementAbsorber As New ImagePlacementAbsorber()
page.Accept(imagePlacementAbsorber)
' Extracted images
For Each imagePlacement As ImagePlacement In imagePlacementAbsorber.ImagePlacements
' Get image properties
Dim imageType As String = imagePlacement.FileType
Dim imageName As String = imagePlacement.Name
Dim imageSize As Long = imagePlacement.Width * imagePlacement.Height ' Calculate image size
Dim imageX As Single = imagePlacement.Rectangle.LLX
Dim imageY As Single = imagePlacement.Rectangle.LLY
Dim imageWidth As Single = imagePlacement.Rectangle.URX - imagePlacement.Rectangle.LLX
Dim imageHeight As Single = imagePlacement.Rectangle.URY - imagePlacement.Rectangle.LLY
' Output image information
Console.WriteLine($"Image Type: {imageType}")
Console.WriteLine($"Image Name: {imageName}")
Console.WriteLine($"Image Size: {imageSize} square points")
Console.WriteLine($"Image Position (X, Y): ({imageX}, {imageY})")
Console.WriteLine($"Image Size (Width x Height): {imageWidth} x {imageHeight}")
Console.WriteLine()
' Count images
imageCount += 1
Next
Next
' Output total counts
Console.WriteLine($"Total Images: {imageCount}")
Console.WriteLine($"Total Audios: {audioCount}")
Console.WriteLine($"Total Videos: {videoCount}")
' Save the extracted images
For Each page As Page In pdfDocument.Pages
Dim imageCounter As Integer = 1
Dim imagePlacementAbsorber As New ImagePlacementAbsorber()
page.Accept(imagePlacementAbsorber)
For Each imagePlacement As ImagePlacement In imagePlacementAbsorber.ImagePlacements
Using imageStream As Stream = imagePlacement.GetImageStream()
Using imageFileStream As New FileStream($"output_image_{imageCounter}.png", FileMode.Create)
imageStream.CopyTo(imageFileStream)
imageCounter += 1
End Using
End Using
Next
Next
End Sub
End Module
However, please note that extracting media files from a PDF is quite complex due to the PDF files structures. In case you face any issues, please share your sample file with us. We will test the scenario in our environment and address it accordingly.
Thanks for help!
However, there is a problem reported by the DEV who says “the ImageAbsorber class is not defined.”.
For this issue, which version of the api dll support this feature? for the version of pdf.dll version 23.4 may not support ImageAbsorber class? Or tell me how to enable the feature of ImageAbsorber class in Pdf.dll version 23.4?
@ducaisoft
The Class is actually ImagePlacementAbsorber. We apologize for the confusion. We have edited the code accordingly in our previous reply.