Get text at a position in a PDF

bnodstr · March 27, 2012, 4:57pm

I’m trying to find a method that allows for two different methods of PDF redaction. I have files where an exact phase is being replaced and another where I have a fixed phase and a varying value.

The method outlined at Replace Text in PDF|Aspose.PDF for .NET is working well when the string is a known value.

The other method in question needs to match a string and allow for text in proximity to it to be modified for redaction. The example I’m working with is along the lines of the string “GPA:3.5”. I need to find “GPA” (which I can do currently) and redacted the text to the right of its position. The value for GPA will vary in each of the documents being processed.

I’ve dug through the documentation and can’t find a technique to access a fragment based upon position.

I maybe approaching the problem incorrectly in terms of how Aspose allow interaction with the text in a PDF. If there is an alternate approach let me know.

Thanks.

rashid.ali · March 27, 2012, 11:54pm

Hi Brain,

Thank you for your interest in our products.

You may check the following documentation links for more details and code snippets about working with text in existing PDF documents.

Working with Text

Working with Text (Facades)

if you still face any problem, kindly share the sample source code and template documents you are using or create a sample application to show the issue. This will help us to figure out the issue and reply back to you soon.

We apologize for your inconvenience.

Thanks & Regards,

bnodstr · March 28, 2012, 10:18am

Rashid,

Thanks for the reply. I’ve looked through the two links and the various examples. I don’t see how it would addresses my scenario.

I’ve attached an example file with test data. In the attached file I need to be able to redact the SAT, ACT, MCAT, and DAT data blocks. The blocks will vary in content and position as an application could have none, one, or many of the elements in question. Simple text extraction, pattern matching, or fixed position replacement isn’t sufficient in this case.

I was hoping I could match a fixed string value of the headers like “Vrbl” in the SAT block. Then get the position of the MCAT header and be able to access and redact the text between those positions. This approach may not be possible so I’m open to any suggestions.

For fixed string values or even patterned data the redaction works great. I was considering doing a wild card pattern to get access to all the string fragments and trying to figure out if one is question needed to be redacted, but this approach looks problematic.

Thanks for any assistance.

rashid.ali · March 29, 2012, 8:22am

Hi Brain,

Thank you for sharing the template document, as per my understanding you have different SAT, ACT, MCAT, and DAT data blocks. Number of blocks and position of blocks can be vary in PDF document. You want to replace the text of "Vrbl" to say "Phy" based on SAT and replace the text of "Vrbl" to say "Chm" based on MCAT. Kindly correct me if I am wrong and if I am not wrong, then I am very sorry to say that currently, there are no direct means available in Aspose.Pdf to fullfil your requirement.

Please feel free to contact support in case you need any further assistance.

Thanks & Regards,

bnodstr · March 29, 2012, 9:07am

Rashid,

You’ve almost got it. I’m wanting to find a known string like “Vrbl” and use its position to access and modify other string values in proximity. I thought it might be a long shot.

Thanks for looking into it.

rashid.ali · March 29, 2012, 12:28pm

Hi Brain,

Thanks for sharing the details, I have logged it for further investigation with ID: PDFNEWNET-33474 in our issue tracking system. Our development team is looking into this feature and you will be updated via this forum thread on the status of correction.

We apologize for your inconvenience.

Thanks & Regards,

bnodstr · March 29, 2012, 1:15pm

Rashid,

No worries and thanks for looking into it.

tilal.ahmad · August 31, 2015, 9:24am

Hi Brain,

Thanks for your patience. Please note that Aspose.Pdf does not provide a direct way to find and replace text based on position. However, TextFragments has properties Position and Rectangle that allow mapping fragments on the page. You can also specify the Rectangle property of TextSearchOptions of TextFragmentAbsorber to limit the zone when text searching (or segments absorbing) will be done.

You want to find a known string like “Vrbl” and use its position to access and modify other string values in proximity. There are at least two ways to do this using Aspose.Pdf:

First Way

It is possible to find a string like “Vrbl” and determine its position. Next, you should form a rectangle based on the position in the direction you need.
Next, you can perform a new search within the borders of the rectangle to find segments near the initial string.

In the following example, we search for the top string “Vrbl” and change the value of the first text fragment located below the initial fragment:

//open document
Document pdfDocument = new Document(myDir + "John_Doe_Profile.pdf");

Page page = pdfDocument.Pages[1];

//create TextFragmentAbsorber object to find all instances of the input search phrase
TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber("Vrbl");

//accept the absorber for all the pages
page.Accept(textFragmentAbsorber);

//get the extracted text fragments
TextFragmentCollection initialCollection = textFragmentAbsorber.TextFragments;
TextFragment initFragment = initialCollection[1];

//select top 'Vrbl' fragment
foreach (TextFragment textFragment in initialCollection)
{
    if (textFragment.Position.YIndent > initFragment.Position.YIndent)
    {
        initFragment = textFragment;
    }
}

//create rectangle below 'Vrbl' fragment
Aspose.Pdf.Rectangle rect = new Aspose.Pdf.Rectangle(
    initFragment.Rectangle.LLX,
    initFragment.Rectangle.LLY - initFragment.TextState.FontSize * 2,
    initFragment.Rectangle.URX,
    initFragment.Rectangle.LLY - 2
);

//recreate TextFragmentAbsorber
textFragmentAbsorber = new TextFragmentAbsorber();
TextSearchOptions options = new TextSearchOptions(rect);
textFragmentAbsorber.TextSearchOptions = options;

//accept the absorber for all the pages
page.Accept(textFragmentAbsorber);

//get the extracted text fragments
TextFragmentCollection newCollection = textFragmentAbsorber.TextFragments;
TextFragment fragmentBelow = null;

if (newCollection.Count > 0)
{
    fragmentBelow = newCollection[1];
}

//select top segment below from 'Vrbl' fragment
foreach (TextFragment fragment in newCollection)
{
    if (fragment.Position.YIndent > initFragment.Position.YIndent)
    {
        fragmentBelow = fragment;
    }
}

//replace fragment text
if (fragmentBelow != null)
{
    fragmentBelow.Text = "777";
}

pdfDocument.Save(myDir + "33474_out.pdf");

Second Way

Try our new feature (TableAbsorber class) to find tables and table elements in an existing PDF document.

Please feel free to contact us for any further assistance.

Best Regards,

aspose.notifier · February 7, 2019, 4:46pm

The issues you have found earlier (filed as ) have been fixed in this update. This message was posted using BugNotificationTool from Downloads module by MuzammilKhan