Threading pdf pages absorber and processing

I have no idea how we are going to do this via email, but I would like to iterate through all pages of a pdf document, with each page running in its own thread. I would like to limit the number of threads as a parameter. Nothing I try is working.

''Mase Woods
''10/25/2024
''Main Function to Process Scheduels and Tags for ThreadQueue
Function ProcessPlanset(ByVal CompanyID As Integer, ByVal DocID As Integer, ByVal documentBytesArray As Byte()) As Boolean
Dim iReturn As Boolean = True
Dim pdfDocument As Aspose.Pdf.Document
Dim Page As Aspose.Pdf.Page

  Dim itPage As PageIteration
  Dim itTotal As New PageIteration
  Dim iPageReturn As Boolean

  Try


      Dim maxConcurrentThreads As Integer = 5 ' Set the maximum number of concurrent threads

      ' Initialize the semaphore with the maximum number of concurrent threads
      semaphore = New Semaphore(maxConcurrentThreads, maxConcurrentThreads)


      ' Attach event handler
      AddHandler PageProcessed, AddressOf OnPageProcessed



      pdfDocument = New Aspose.Pdf.Document(FullFileName)
      itTotal.FileName = pdfDocument.FileName
      Dim pageNum As Integer = 0
      For Each Page In pdfDocument.Pages
          pageNum = pageNum + 1
          itPage = New PageIteration
          itPage.PageNumber = Page.Number
          itPage.FileName = pdfDocument.FileName
          ' Initialize the countdown event with the number of pages
          countdown = New CountdownEvent(pdfDocument.Pages.Count)

          'iPageReturn = ProcessPage(Page)
          Try
              ThreadPool.QueueUserWorkItem(AddressOf ProcessPage, Page)

          Catch ex As Exception
              Stop
          End Try
          itPage.EndTime = Date.Now
          itPage.ElapsedTime = itPage.GetTimeDifference(itPage.StartTime, itPage.EndTime)
          itTotal.AllText = itTotal.AllText & (System.IO.Path.GetFileName(itPage.FileName) & " Page Number: " & itPage.PageNumber & ": " & itPage.ElapsedTime & vbCrLf)

      Next

      ' Wait for all threads to complete
      countdown.Wait()

      itTotal.EndTime = Date.Now
      itTotal.ElapsedTime = itTotal.GetTimeDifference(itTotal.StartTime, itTotal.EndTime)
      itTotal.AllText = itTotal.AllText & (System.IO.Path.GetFileName(itTotal.FileName) & "TOTAL TIME: " & itTotal.ElapsedTime)

  Catch ex As Exception
      iReturn = False
  End Try
  Stop


  Return iReturn

End Function

''Mase Woods
''10/25/2024
''Main Function to Process Page with Scheduels and Tags
Function ProcessPage(ByVal Page As Aspose.Pdf.Page) As Boolean
’ Wait until the semaphore is available
semaphore.WaitOne()

  Dim iReturn As Boolean = True
  'Find all tables on page

  Try
      Dim pageNumber As Integer = CInt(Page.Number)
      ' Your file processing logic here
      Console.WriteLine($"Processing page {pageNumber}")

      ' Create a lock object for synchronization
      Dim lockObject As New Object()

      SyncLock lockObject
          Dim absorber As New Aspose.Pdf.Text.TableAbsorber
          Try
              'absorber = getAbsorber(Page)
              absorber.Visit(Page)
          Catch ex As Exception
              Console.WriteLine(ex.Message)
          End Try


          If absorber.TableList.Count > 0 Then 'Has no grids on page

              'Find Grids that are Schdeules areturn a list of shedules
              Dim ScheduleGrids As List(Of ScheduleGrid)
              ScheduleGrids = getSchedules(absorber)

              'Process Shcedules
              If ScheduleGrids.Count > 0 Then  'Has grids and shedules
                  'Process Grids, from list of Shcedules
                  For Each grdTable In ScheduleGrids
                      Dim iprocess As Boolean
                      iprocess = processTags(grdTable)
                  Next
                  ScheduleGrids = Nothing
              End If

          End If

          RaiseEvent PageProcessed(Page)
      End SyncLock
  Catch ex As Exception
      iReturn = False
  Finally
      ' Release the semaphore
      semaphore.Release()
      ' Signal the countdown event
      countdown.Signal()
  End Try


  Return iReturn

End Function

''Mase Woods
''10/25/2024
''Function Raised on Thread Conmpletion
Private Sub OnPageProcessed(page As Aspose.Pdf.Page)
Console.WriteLine($“Page {page.Number} has been processed.”)
End Sub

@maseyo

To process PDF pages in parallel using Aspose.PDF, you can utilize threading, but it’s important to note that Aspose.PDF is designed to be thread-safe only when a single document is accessed by one thread at a time. This means that while you can use multiple threads, each thread should handle a separate PDF document, or you should ensure that only one thread processes a specific document at any given time.

Here’s a refined approach to your implementation, ensuring that you adhere to the thread-safety guidelines:

  1. Semaphore for Thread Limiting: You are already using a semaphore to limit the number of concurrent threads, which is a good practice.

  2. Processing Each Page: You can continue to use ThreadPool.QueueUserWorkItem to process each page, but ensure that the ProcessPage function is designed to handle synchronization properly.

  3. Avoiding Parallel Processing of the Same Document: Make sure that no two threads are processing the same document simultaneously. You can achieve this by using a lock or by managing the threads carefully.

Here’s a modified version of your ProcessPage function that includes proper synchronization:

Function ProcessPage(ByVal Page As Aspose.Pdf.Page) As Boolean
    ' Wait until the semaphore is available
    semaphore.WaitOne()

    Dim iReturn As Boolean = True
    Try
        Dim pageNumber As Integer = CInt(Page.Number)
        Console.WriteLine($"Processing page {pageNumber}")

        ' Create a lock object for synchronization
        SyncLock lockObject
            Dim absorber As New Aspose.Pdf.Text.TableAbsorber
            absorber.Visit(Page)

            If absorber.TableList.Count > 0 Then
                Dim ScheduleGrids As List(Of ScheduleGrid) = getSchedules(absorber)

                If ScheduleGrids.Count > 0 Then
                    For Each grdTable In ScheduleGrids
                        processTags(grdTable)
                    Next
                End If
            End If

            RaiseEvent PageProcessed(Page)
        End SyncLock
    Catch ex As Exception
        iReturn = False
    Finally
        ' Release the semaphore
        semaphore.Release()
        ' Signal the countdown event
        countdown.Signal()
    End Try

    Return iReturn
End Function

Important Considerations:

  • Thread Safety: Ensure that the ProcessPage function does not access shared resources without proper synchronization.
  • Error Handling: Implement robust error handling to manage exceptions that may arise during processing.
  • Testing: Test your implementation with various PDF documents to ensure that it behaves as expected, especially with documents containing right-to-left text, as these may introduce additional complexities.

By following these guidelines, you should be able to effectively process PDF pages in parallel while adhering to the thread-safety requirements of the Aspose.PDF library. If you encounter any specific issues or exceptions, feel free to ask for further assistance!

Are you saying that if I have many pdf files I can process each file in its own thread. But I cannot iterate through each page within one file as its own thread?

@maseyo

Yes, your understandings are correct.