Searching individual PDF pages not working

ityrrell · October 3, 2017, 9:37am

Hi

I want to split a PDF document into individual pages, except where I find “2PW” on a page. Where “2PW“ is found then that page and the following page are to be output to a single file.
I am searching individual pages in the PDF document looking for “2PW”. The problem I am having is when the phrase is found all the subsequent pages also find the phrase, even when it isn’t present on those pages
Please find example code and a TEST document. I have downloaded the most recent .Net ASPOSE PDF dll 17.9.0.0

Dim licPDF As Aspose.Pdf.License = New Aspose.Pdf.License
licPDF.SetLicense(clsF2F.Globals.PathLicences)
Dim txtInputPath As String = “C:\Temp\In”
Dim txtOutputPath As String = “C:\Temp\Out”

Dim strFiles As String() = System.IO.Directory.GetFiles(txtInputPath, “*.pdf”, IO.SearchOption.TopDirectoryOnly)

For Each strFile As String In strFiles
Dim pdfDoc As Aspose.Pdf.Document = Nothing
pdfDoc = New Aspose.Pdf.Document(strFile)
Dim textFragmentAbsorber As New Aspose.Pdf.Text.TextFragmentAbsorber(“2PW”)
Dim pdfNewDoc As Aspose.Pdf.Document = Nothing
Dim intPage As Integer = 0
Dim int2PageWarrant As Integer = 0
'int2PageWarrant meaning 0= Not in it
’ 1= found it and on the first page
’ 2= On the second page
For Each PdfP As Aspose.Pdf.Page In pdfDoc.Pages
intPage += 1
'Search inside a page in the PDF for ‘2PW’. Which is the indicator to say this is a 2-page Warrant document
pdfDoc.Pages(intPage).Accept(textFragmentAbsorber)
Dim textFragmentCollection As Aspose.Pdf.Text.TextFragmentCollection = Nothing
textFragmentCollection = textFragmentAbsorber.TextFragments
If Not IsNothing(textFragmentCollection) Then
’ Loop through the fragments
For Each textFragment As Aspose.Pdf.Text.TextFragment In textFragmentCollection
int2PageWarrant = 1
Next
End If

    Dim strOutputFile As String = ""
    If int2PageWarrant = 0 Then
       pdfNewDoc = New Aspose.Pdf.Document
       pdfNewDoc.Pages.Add(PdfP)
       strOutputFile = txtOutputPath & System.IO.Path.GetFileNameWithoutExtension(strFile) & "_" & intPage.ToString.PadLeft(4, "0") & ".pdf"
       pdfNewDoc.Save(strOutputFile)
    ElseIf int2PageWarrant = 1 Then
       pdfNewDoc = New Aspose.Pdf.Document
       pdfNewDoc.Pages.Add(PdfP)
       int2PageWarrant += 1
    ElseIf int2PageWarrant = 2 Then
       pdfNewDoc.Pages.Add(PdfP)
       strOutputFile = txtOutputPath & System.IO.Path.GetFileNameWithoutExtension(strFile) & "_" & intPage.ToString.PadLeft(4, "0") & "II.pdf"
       pdfNewDoc.Save(strOutputFile)
       int2PageWarrant = 0
    Else
       Throw New Exception("UnExpected Warrant Page count")
    End If

 Next

Many Thanks
Ian Tyrrell

asad.ali · October 3, 2017, 2:16pm

@ityrrell

Thanks for contacting support.

We have tested the scenario with Aspose.Pdf for .NET 17.9 and observed the same behavior which you have mentioned. The reason of this behavior was that, the TextFragmentCollection was not being refreshed/clear in each iteration of foreach loop. So when at the Page 5, specified string is found, API adds it into the TextFragmentCollection of TextFragmentAbsorber, which remains filled till the end of the loop.

You may also check at your end, by determining the page number of found TextFragment by using TextFragment.Page.Number property and you will notice that in all upcoming iterations, page number remains 5. Nevertheless, in order to get most recent collection of found text, you need to initialize TextFragmentAbsorber instance inside the loop like following.

For Each PdfP As Aspose.Pdf.Page In pdfDoc.Pages 
Dim textFragmentAbsorber As New Aspose.Pdf.Text.TextFragmentAbsorber(“2PW”)

We have tested the scenario as well, by using above mentioned approach and results were fine over our end. For your reference, we have also shared generated output. In case of any further assistance, please feel free to let us know.

TEST2_ 6II.pdf (84.9 KB)