We're sorry Aspose doesn't work properply without JavaScript enabled.

Free Support Forum - aspose.com

Text substitution does not maintain the spacing

Hi,

I am trying to use Aspose.PDF for Java to find and replace the text.

Following is my code snippet.

Document pdfDocument = new Document(inputStream);
		TextFragmentCollection textFragmentCollection = getTextFragmentCollection(pdfDocument);
		for (TextFragment textFragment : textFragmentCollection) {
			textFragment.getTextState().setFont(FONT);
			textFragment.setText(clearText);
		}
		pdfDocument.save(outputStream);

But after replacement, the replaced text seem to have white trailing space. I am running this on Ubuntu 12.04
Original
Screen Shot 2017-08-06 at 8.29.07 AM.png (7.9 KB)
Modified
Screen Shot 2017-08-06 at 8.29.13 AM.png (3.5 KB)

I am attaching the following

  1. Original PDF
  2. Java Code
  3. Font used
  4. Output PDF

aspose_test.zip (221.7 KB)

Please let me know what could be wrong here.

@muthukrishnanm,
Please note, PDF is a fixed file format and the text elements do not move to cover the blank area after the replacement of text strings. In order to achieve the goal, you can enhance the way of text replacement. For example, in your scenario, you can retrieve a complete line of text, modify the line of text as per your need, and then redact that region of the page as narrated in this help topic: Redact certain page region with RedactionAnnotation

OverlayText property of the RedactionAnnotation class allows to set text in that region of the page. We hope, this helps. Kindly let us know in case of any confusion or questions.

Best Regards,
Imran Rafique

Hi @imran.rafique, how will I get the rectangle dimensions dynamically when I try to redact? Because with my use case, I can replace a certain text but cannot detect the whitespace. Can you please give me a code sample where you can read line by line, change it and infer a rectangle dynamically for the whitespace.

Hi @imran.rafique

I tried the option suggested by you. Below is my code.

Document pdfDocument = new Document(inputStream);
TextFragmentCollection textFragmentCollection = getTextFragmentCollection(pdfDocument);

	for (TextFragment textFragment : textFragmentCollection) {
		textFragment.getTextState().setFont(FONT);
		RedactionAnnotation annot = new RedactionAnnotation(textFragment.getPage(), textFragment.getRectangle());
		annot.setOverlayText(clearText);
		annot.setRepeat(true);
		/* annot.redact();  If I uncomment this, then I get a NullPointerException*/
		textFragment.getPage().getAnnotations().add(annot);
	}
	pdfDocument.save(outputStream);

Now the PDF appears like this
Screen Shot 2017-08-07 at 8.59.09 AM.png (3.7 KB)
This is not what I intended.

When I call .redact() method, I get the following exception

Exception in thread "main" java.lang.NullPointerException
	at com.aspose.pdf.AnnotationCollection.delete(Unknown Source)
	at com.aspose.pdf.Annotation.flatten(Unknown Source)
	at com.aspose.pdf.RedactionAnnotation.flatten(Unknown Source)
	at com.aspose.pdf.RedactionAnnotation.redact(Unknown Source)
	at com.shn.test.aspose.AsposeTester.substituteTokens(AsposeTester.java:65)
	at com.shn.test.aspose.AsposeTester.main(AsposeTester.java:104)

@muthukrishnanm,
We are testing your scenario in our environment and will get back to you soon.

Best Regards,
Imran Rafique

@muthukrishnanm,
Please try the following code example:

[Java]

Document pdfDocument = new Document(inputStream);
TextFragmentCollection textFragmentCollection = getTextFragmentCollection(pdfDocument);
for (TextFragment textFragment : textFragmentCollection) {
	textFragment.getTextState().setFont(FONT);
	for (int count = 1; count <= textFragment.getSegments().size(); count++)
	{
	    // Create RedactionAnnotation instance for specific page region
	    RedactionAnnotation annot = new RedactionAnnotation(pdfDocument.getPages().get_Item(1), textFragment.getSegments().get_Item(count).getRectangle());
	    annot.setFillColor(com.aspose.pdf.Color.getWhite());
	    annot.setColor(com.aspose.pdf.Color.getBlack());
	    // Repat Overlay text over redact Annotation
	    annot.setRepeat(false);
	    annot.setOverlayText(clearText);
	    // Add annotation to annotations collection of first page
	    pdfDocument.getPages().get_Item(1).getAnnotations().add(annot);
	    // Flattens annotation and redacts page contents (i.e. removes text and image
	    // Under redacted annotation)
	    annot.redact();
	}
}
pdfDocument.save(outputStream);

Hi @imran.rafique

I tried the option suggested by you. Still, the output pdf is not what we expect. That has blank spaces in it after replacement. Now I see that replacement is also not proper (see the trailing ‘D’ in the first replacement).

Screen Shot 2017-08-08 at 9.19.29 AM.png (2.0 KB)

The code that I used is also attached.

RedactTester.java.zip (1.2 KB)

Please let me know if I am missing anything here.

@muthukrishnanm,
Kindly enhance your regular expression, so that it could retrieve the complete text of the target area. The recent regular expression only retrieves strings like “9CA3510016EFCD82104F5000C39ADE9FB3AE0DD0”, and you can only modify this part of the page region. When you will replace this string with “hello”, it would show the rest as an empty blank area. However, if your regular expression will cover the complete text like: “attr2: 9CA3510016EFCD82104F5000C39ADE9FB3AE0DD0, 0EDD3B7FE76D3109407B40EDBC4BBF8BC7281837, attr1: 1538E4D91116BCBC3466F12937CB6D74492A8112”, then you can further apply a regular expression to replace the formatted strings with the word “hello”. Finally, you will replace the complete text without the empty blank area.

We hope, this helps. Kindly let us know in case of any confusion or questions.

Best Regards,
Imran Rafique

Hi @imran.rafique

That is not currently possible as always I will not have a static text (like attr1 or attr2) before my current regex. So the only way for me is to look for the 40 length alphanumeric text. Will redact not help here completely? If not, can you please provide me with an option to adjust the white space?

Thanks in advance
Muthu

@muthukrishnanm,
As a workaround, you can extend the rectangle size of the text fragment vertically, and then perform redact operation. Please try the following code:

[Java]

for (TextFragment textFragment : textFragmentCollection) 
{
	textFragment.getTextState().setFont(FONT);			
	Rectangle rect = textFragment.getSegments().get_Item(1).getRectangle();
	if(LLY == rect.getLLY() & URY == rect.getURY())
 	    break;
	rect.setLLX(pdfDocument.getPages().get_Item(1).getPageInfo().getMargin().getLeft()/3);
	rect.setURX(pdfDocument.getPages().get_Item(1).getRect().getURX() - pdfDocument.getPages().get_Item(1).getPageInfo().getMargin().getRight()/3);
	LLY = rect.getLLY();
	URY = rect.getURY();
			
	// retrieve text of the complete line
	// create TextAbsorber object to extract text
	TextAbsorber absorber = new TextAbsorber();
	absorber.getTextSearchOptions().setLimitToPageBounds(true);
	absorber.getTextSearchOptions().setRectangle(rect);
	// accept the absorber for first page
	pdfDocument.getPages().get_Item(1).accept(absorber);
	// get the extracted text
	String extractedText = absorber.getText();
	String actualString =extractedText.replaceAll("[0-9a-f]{64}+|[0-9A-F]{40}+", clearText);
			
	for (int count = 1; count <= textFragment.getSegments().size(); count++)
	{
	    // Create RedactionAnnotation instance for specific page region
	    RedactionAnnotation annot = new RedactionAnnotation(pdfDocument.getPages().get_Item(1), rect);
	    annot.setFillColor(com.aspose.pdf.Color.getWhite());
 	    // Repat Overlay text over redact Annotation
	    annot.setRepeat(false);
	    annot.setOverlayText(actualString);
	    // Add annotation to annotations collection of first page
	    pdfDocument.getPages().get_Item(1).getAnnotations().add(annot);
	    // Flattens annotation and redacts page contents (i.e. removes text and image
	    // Under redacted annotation)
	    annot.redact();
	}
}
pdfDocument.save(outputStream);

Please let us know in case of any confusion or questions.

Best Regards,
Imran Rafique

Thank you @imran.rafique, I still see white space in the replaced document. As well the alignment of the PDF is lost.

Original
Screen Shot 2017-08-09 at 9.12.30 AM.png (15.0 KB)

Modified
Screen Shot 2017-08-09 at 9.12.40 AM.png (9.9 KB)

Notice that the date time has moved to the left.

@muthukrishnanm,
Please manipulate the retrieved text string as below:

[Java]

// get the extracted text
String extractedText = absorber.getText().trim();
String[] str = extractedText.replaceAll("[0-9a-f]{64}+|[0-9A-F]{40}+", clearText).split("  +");
String actualString = "";
for(int i = 0;i<str.length-1;i++)
    actualString = actualString + str[i];
int numberWhitespaces = extractedText.length()-actualString.length();
actualString = actualString+String.format("%1$" + numberWhitespaces + "s", str[str.length-1]);

Best Regards,
Imran Rafique