Text fragment at coordinates

sjoshi · March 1, 2017, 8:23am

Does the aspose .net library allow to fragment text at a specified coordinate location?
We have pdfs that need text to be erased at a specified coordinate location and replaced with spaces.
Currently we do whiteboxing of the rectangular area and overlay it with new data downstream. This causes overhead and is not an efficient solution.

asad.ali · March 2, 2017, 5:38am

Hi There,

Thanks for contacting support.

In order to add text at specified coordinates inside page please check the following code snippet. You may use TextBuilder class to append Textfragments into it at specified location/coordinates. I have also attached sample input/output documents which have been used in the following code snippet.

Document doc = new Document(dataDir + "SampleText.pdf");

// Create a TextFragment
TextFragment textFragment = new TextFragment(" ");
textFragment.Position = new Position(97, 605);

// Set text properties
textFragment.TextState.BackgroundColor = Aspose.Pdf.Color.FromRgb(System.Drawing.Color.White);
textFragment.TextState.ForegroundColor = Aspose.Pdf.Color.FromRgb(System.Drawing.Color.Black);

// Create a TextBuilder and append the TextFragment
TextBuilder textBuilder = new TextBuilder(doc.Pages[1]);
textBuilder.AppendText(textFragment);

// Save the modified PDF
doc.Save(dataDir + "SampleText_out.pdf");

You may also check “Add Text in PDF” section in our API documentation. In case of any further assistance please feel free to contact us.

Best Regards,

sjoshi · March 2, 2017, 1:27pm

Dim pdf As New Aspose.Pdf.Document(PDFfilename)
Dim TextFragmentAbsorberAddress As New Aspose.Pdf.Text.TextFragmentAbsorber()

' Set search options to limit to a specific rectangular area
TextFragmentAbsorberAddress.TextSearchOptions.LimitToPageBounds = True
TextFragmentAbsorberAddress.TextSearchOptions.Rectangle = New Aspose.Pdf.Rectangle(lx, ly, ux, uy)

' Accept the text fragments within the specified rectangle
pdf.Pages(1).Accept(TextFragmentAbsorberAddress)

' Loop through the found text fragments and clear their text content
For Each tf As Aspose.Pdf.Text.TextFragment In TextFragmentAbsorberAddress.TextFragments
    tf.Text = ""
Next

' Save the modified PDF
pdf.Save(PDFfilename)

Figured it out…the above code will replace all text within the rectangular area and replace with spaces.

asad.ali · March 3, 2017, 8:12am

Hi There,

Thanks for your feedback and sharing the code snippet. It will be beneficial for others in order to implement the similar functionality. Please keep using our API and in case you have any other query please feel free to let us know. We will be more than happy to extend our support.

Best Regards,

sjoshi · June 18, 2018, 2:12pm

Hello Support,
I have noticed an issue with the APOSE dll. I have code that extracts text and replaces it with empty string within a rectangular area.
The code works fine on smaller pdfs. It crashes on a PDF that is bigger that 20MB in size.

I do the following in code:
1)Open a csv file with the name of the pdf document.
2)Open the pdf file specified in the CSV file.
3) Extract text at the location specified.
4)I am running the code pasted below:

Dim License As New Aspose.Pdf.License

' Set the license file path correctly
License.SetLicense("C:\Path\To\Your\License\Aspose.Total.lic")

Dim pdf As New Aspose.Pdf.Document(Path.Combine(PDFPath, SequenceStep.Sequence.CurrentFile.GetFileNameWithoutExtension & ".pdf"))
PDFfilename = Path.Combine(PDFPath, SequenceStep.Sequence.CurrentFile.GetFileNameWithoutExtension & ".pdf")

While Not MyReader.EndOfData
    currentRow = MyReader.ReadFields() ' Row from the CSV file that gives the page number for extraction
    
    ' Check if the currentRow has enough fields before accessing currentRow(4)
    If currentRow.Length > 4 Then
        Dim pageNumber As Integer = CInt(currentRow(4))
        
        ' Ensure the page number is within the valid range
        If pageNumber >= 1 AndAlso pageNumber <= pdf.Pages.Count Then
            Dim TextFragmentAbsorberAddress As New Aspose.Pdf.Text.TextFragmentAbsorber()
            TextFragmentAbsorberAddress.TextSearchOptions.LimitToPageBounds = True
            TextFragmentAbsorberAddress.TextSearchOptions.Rectangle = New Aspose.Pdf.Rectangle(Specs(0), Specs(1), Specs(2), Specs(3))
            
            ' Accept the text fragments within the specified rectangle on the specified page
            pdf.Pages(pageNumber).Accept(TextFragmentAbsorberAddress)
            
            For Each tf As Aspose.Pdf.Text.TextFragment In TextFragmentAbsorberAddress.TextFragments
                If Not String.IsNullOrWhiteSpace(tf.Text) Then
                    tf.Text = "" ' Replace text with an empty string
                End If
            Next
        Else
            ' Handle invalid page number
            Console.WriteLine("Invalid page number: " & pageNumber)
        End If
    Else
        ' Handle rows with insufficient data
        Console.WriteLine("Row does not contain enough data")
    End If
End While

' Flatten the PDF and save the modified document
pdf.Flatten()
pdf.Save(PDFfilename)

Please advice on a solution as this is holding up a crucial release.

Thanks,
Shilpa

asad.ali · June 18, 2018, 7:54pm

@sjoshi

Thanks for contacting support.

Would you please share your sample PDF document with us. We will test the scenario in our environment and address it accordingly.

Furthermore, please also share the values of Specs variable in your above code. This would help us testing the scenario accordingly. In case your sample PDF document is more than of 3.0MB size, you may please upload it to Google Drive or Dropbox and share the link with us.

sjoshi · June 22, 2018, 2:40pm

Hello,
Is there any other way we can trouble shoot this problem as we will not be able to share the pdf due to confidentiality(HIPPA) regulations.

Thanks!

asad.ali · June 22, 2018, 3:14pm

@sjoshi

Thanks for writing back.

We have tested the scenario with one of our sample PDF documents (i.e 32MB) and were unable to notice any issue. Please note that sometimes issue can be document specific and in order to replicate the issue, we need that specific document. In case you cannot share document here, you may please send it in a private message. This way it will only be accessible by Aspose Staff.

Please also confirm if your program crashes while processing single document of 20MB size, OR your program/code is processing more than one file and crashes at some specific PDF document? You may please narrow down your use case and share only problematic source document with us. In case the issue is related to our API, we will definitely address it accordingly.

sjoshi · June 22, 2018, 3:34pm

Hello,
I am processing only one PDF at a time. The PDF has 39MB which has 8000+ pages. The first 4000 pages process in around 20 mins but the next 4000 crash my computer.
I am deleting text out on every other page, the text that gets deleted is inside the rectangular area defined by specs.
I have all the page numbers to be cleaned in a list and sending in the list to the method above. The ASPOSE method references page numbers(1,3,5,7,9…as I am deleting content from every other page) from the list and replaces text on the pages(1,3,5,7,9…)with an empty string.

asad.ali · June 22, 2018, 4:46pm

@sjoshi

Thanks for sharing further details.

It seems document specific issue and we need that document along with complete list of pages and coordinates, where text needs to be replaced with empty value. We will test the scenario in our environment and address it accordingly. We assure you that we do not disclose shared documents with anyone and erase/discard them soon after investigating the scenario.