How to extract any type of audios and videos embedded in pdf as well as images

ducaisoft · October 8, 2023, 11:40pm

Hi, Support:

Would you please provide me a full demo based on VB.net and V23.4 to reach this goals to extract any type of audios and videos embedded in pdf as well as images? the type of embedded audios/videos may be annotation or ScreenAnnotation or RichMediaAnnotation or EmbeddedFile. In this demo, it should show the code how to get the type,name,size,x-y postion,w-h size of each element to be extract, and detect whether those element exists in the pdf or given page of the pdf, and get the total count of each type of the element in the whole pdf or given page of the pdf.

Hope your help!
Thanks a lot!

asad.ali · October 9, 2023, 12:40pm

@ducaisoft

Below is the sample code snippet to extract rich media from the PDF:

Dim pdfDocument As New Document(dataDir & "Cannot_Extract_Audios.pdf")
For Each page As Page In pdfDocument.Pages
    For Each annotation As Annotation In page.Annotations
        If TypeOf annotation Is RichMediaAnnotation Then
            Dim ann As RichMediaAnnotation = CType(annotation, RichMediaAnnotation)
            Using stream As New MemoryStream()
                ann.Content.CopyTo(stream)
                Dim fs As New FileStream(dataDir & "extractedaudio.mpa", FileMode.CreateNew)
                ann.Content.CopyTo(fs)
                fs.Close()
            End Using
            Dim rect = ann.Rect
            Dim height = ann.Height
            Dim width = ann.Width
        End If
        If TypeOf annotation Is ScreenAnnotation Then
            Dim ann As ScreenAnnotation = CType(annotation, ScreenAnnotation)
            Dim renditionaction As RenditionAction = CType(ann.Action, RenditionAction)
            Dim mediaRendition As MediaRendition = CType(renditionaction.Rendition, MediaRendition)
            Dim mediaclip As MediaClipData = CType(mediaRendition.MediaClip, MediaClipData)
            Dim fs As New FileStream(dataDir & mediaclip.Data.Name, FileMode.CreateNew)
            mediaclip.Data.Contents.CopyTo(fs)
            fs.Close()
        End If
    Next
Next

In order to extract images, you can use below code snippet:

Imports Aspose.Pdf
Imports Aspose.Pdf.Text
Imports System.IO

Module Module1
    Sub Main()
        ' Load the PDF document
        Dim pdfDocument As New Document("input.pdf")

        ' Initialize counters for images, audios, and videos
        Dim imageCount As Integer = 0
        Dim audioCount As Integer = 0
        Dim videoCount As Integer = 0

        ' Iterate through the pages
        For Each page As Page In pdfDocument.Pages
            ' Extract images from the page
            Dim imagePlacementAbsorber As New ImagePlacementAbsorber()
            page.Accept(imagePlacementAbsorber)

            ' Extracted images
            For Each imagePlacement As ImagePlacement In imagePlacementAbsorber.ImagePlacements
                ' Get image properties
                Dim imageType As String = imagePlacement.FileType
                Dim imageName As String = imagePlacement.Name
                Dim imageSize As Long = imagePlacement.Width * imagePlacement.Height ' Calculate image size
                Dim imageX As Single = imagePlacement.Rectangle.LLX
                Dim imageY As Single = imagePlacement.Rectangle.LLY
                Dim imageWidth As Single = imagePlacement.Rectangle.URX - imagePlacement.Rectangle.LLX
                Dim imageHeight As Single = imagePlacement.Rectangle.URY - imagePlacement.Rectangle.LLY

                ' Output image information
                Console.WriteLine($"Image Type: {imageType}")
                Console.WriteLine($"Image Name: {imageName}")
                Console.WriteLine($"Image Size: {imageSize} square points")
                Console.WriteLine($"Image Position (X, Y): ({imageX}, {imageY})")
                Console.WriteLine($"Image Size (Width x Height): {imageWidth} x {imageHeight}")
                Console.WriteLine()

                ' Count images
                imageCount += 1
            Next
        Next

        ' Output total counts
        Console.WriteLine($"Total Images: {imageCount}")
        Console.WriteLine($"Total Audios: {audioCount}")
        Console.WriteLine($"Total Videos: {videoCount}")

        ' Save the extracted images
        For Each page As Page In pdfDocument.Pages
            Dim imageCounter As Integer = 1
            Dim imagePlacementAbsorber As New ImagePlacementAbsorber()
            page.Accept(imagePlacementAbsorber)

            For Each imagePlacement As ImagePlacement In imagePlacementAbsorber.ImagePlacements
                Using imageStream As Stream = imagePlacement.GetImageStream()
                    Using imageFileStream As New FileStream($"output_image_{imageCounter}.png", FileMode.Create)
                        imageStream.CopyTo(imageFileStream)
                        imageCounter += 1
                    End Using
                End Using
            Next
        Next
    End Sub
End Module

However, please note that extracting media files from a PDF is quite complex due to the PDF files structures. In case you face any issues, please share your sample file with us. We will test the scenario in our environment and address it accordingly.

ducaisoft · October 9, 2023, 2:30pm

Thanks for help!
However, there is a problem reported by the DEV who says “the ImageAbsorber class is not defined.”.
For this issue, which version of the api dll support this feature? for the version of pdf.dll version 23.4 may not support ImageAbsorber class? Or tell me how to enable the feature of ImageAbsorber class in Pdf.dll version 23.4?

asad.ali · October 9, 2023, 8:04pm

@ducaisoft

The Class is actually ImagePlacementAbsorber. We apologize for the confusion. We have edited the code accordingly in our previous reply.