Replace text in PDF using Java with Aspose.PDF - multi line text replacement is not properly aligned

Amaran · June 4, 2020, 1:45pm

Hi Team,

I am working with aspose pdf library version (19.12).
Where search term is: “through learning. We put the learner at the centre of everything we do, because wherever learning flourishes, so do people. Find out more about how we can help you and your learners”
Replace term is: “Savvas, ALWAYS LEARNING, INVESTIGATIONS, and Savvas REALIZE are exclusive trademarks owned by Savvas Education, Inc. tested this content is updated properly”

After replacement, the text is not properly aligned and it exceeds the margin itself.
Here I have attached the source and after replaced file.

Implemented codes:

			license.setLicense(ReplaceTextAspose.class.getResourceAsStream("/Aspose.Total.Java.lic"));
			final FileInputStream fis = new FileInputStream(filePath);
			Document pdfDocument = new Document(fis);
			textFragmentAbsorber = new TextFragmentAbsorber(searchString);
			// Accept the absorber for first page of document
			pdfDocument.getPages().accept(textFragmentAbsorber);
			// Get the extracted text fragments into collection
			TextFragmentCollection textFragmentCollection = textFragmentAbsorber.getTextFragments();
			TextReplaceOptions textReplaceOptions = textFragmentAbsorber.getTextReplaceOptions();
			textReplaceOptions.setReplaceAdjustmentAction(TextReplaceOptions.ReplaceAdjustment.WholeWordsHyphenation);
			for (TextFragment textFragment : (Iterable<TextFragment>) textFragmentCollection) {
				textFragment.setText(replacement);
			}
			pdfDocument.save(filePath);

after_replaced.pdf (225.1 KB)
source_file.pdf (191.7 KB)

Thanks.

asad.ali · June 4, 2020, 7:22pm

@Amaran

Would you kindly share how you are assigning value to the variable searchString as we tried using the phrase to search in the attached PDF but API did not return any results.

TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber("through learning. We put the learner at the centre of everything we do, because wherever\n" +
                "learning flourishes, so do people. Find out more about how we can help you and your learners", new TextSearchOptions(false));
pdfDocument.getPages().accept(textFragmentAbsorber);

Amaran · June 5, 2020, 12:31pm

We are modifying the search string with regex pattern to check any line break available or not.

Rgex pattern is:
(?<![/])through\slearning.\sWe\sput\sthe\slearner\sat\sthe\scentre\sof\severything\swe\sdo,\sbecause\swherever\slearning\sflourishes,\sso\sdo\speople.\sFind\sout\smore\sabout\show\swe\scan\shelp\syou\sand\syour\s*learners(?!\b.(com|co)\b)[quote=“asad.ali, post:2, topic:213738, full:true”]

Here I have attached a screenshot for my regex pattern. Kindly check it.
screen shot for regex.png (29.7 KB)

@Amaran

Would you kindly share how you are assigning value to the variable searchString as we tried using the phrase to search in the attached PDF but API did not return any results.

TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber("through learning. We put the learner at the centre of everything we do, because wherever\n" +
                "learning flourishes, so do people. Find out more about how we can help you and your learners", new TextSearchOptions(false));
pdfDocument.getPages().accept(textFragmentAbsorber);

[/quote]

asad.ali · June 5, 2020, 7:28pm

@Amaran

Thanks for sharing the details.

We have tried to use similar regular expression in our environment but could not get any results. Would you kindly share sample console application which is able to perform the functionality that you are using and replicate the issue in our environment.

noresult.png (33.9 KB)

Amaran · June 9, 2020, 1:15pm

Can you please modify your regex pattern as per the given screenshot.
screen shot for regex.png (29.7 KB)

asad.ali · June 9, 2020, 8:03pm

@Amaran

The regex pattern in the screenshot is not fully visible.

We have been trying to test the scenario using Aspose.PDF for Java 20.5 as well as 19.12 and following complete code snippet. But, no text returned against the specified regular expression which you have shared. As per our observation, the regular expression is fine but API is not processing it correctly.

final FileInputStream fis = new FileInputStream(dataDir+ "source_file.pdf");
        Document pdfDocument = new Document(fis);
        TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber("(?<![/])through\\slearning.\\sWe\\sput\\sthe\\slearner\\sat\\sthe\\scentre\\sof\\severything\\swe\\sdo,\\sbecause\\swherever\\slearning\\sflourishes,\\sso\\sdo\\speople.\\sFind\\sout\\smore\\sabout\\show\\swe\\scan\\shelp\\syou\\sand\\syour\\s*learners(?!\\b.(com|co)\\b)", new TextSearchOptions(true));
        pdfDocument.getPages().accept(textFragmentAbsorber);

        // Get the extracted text fragments into collection
        TextFragmentCollection textFragmentCollection = textFragmentAbsorber.getTextFragments();
        System.out.println("Count : " + textFragmentCollection.size());
        //pdfDocument.getPages().accept(textFragmentAbsorber);
        // Get the extracted text fragments into collection
        for (TextFragment textFragment:textFragmentCollection) {
            //Update text and other properties
            textFragment.setText("Savvas, ALWAYS LEARNING, INVESTIGATIONS, and Savvas REALIZE are exclusive trademarks owned by Savvas Education, Inc. tested this content is updated properly");
        }
        pdfDocument.save(dataDir + "20.5.pdf");

Would you kindly share a sample console application along with the details of the JDK version that you are using? We will again try to replicate the issue in our environment and address it accordingly.

Amaran · June 17, 2020, 12:40pm

space is replaced by \s*

using this function to get regex pattern.

/**
 * get escape search string 
 * 
 * @param searchString
 * 
 * @return special character escape searchString
 */
public static String getEscapeString(String searchString) {
	String[] split = searchString.split(" ");
	String search_pattern = "[\\n\\r\\s\\t]+".join("\\s*",split);
	String pattern = "(?<![\\/])" + search_pattern + "(?!\\b.(com|co)\\b)";
	return pattern;
}

asad.ali · June 17, 2020, 6:06pm

@Amaran

We were able to replicate the issue in our environment by using the code snippet that you have provided with Aspose.PDF for Java 20.5. Hence, an issue as PDFJAVA-39505 has been logged in our issue management system for the sake of correction. We will further look into its details and keep you posted with the status of its rectificaiton. Please be patient and spare us some time.

We are sorry for the inconvenience.

Amaran · August 6, 2020, 6:40pm

Hi Asad,

Can you please give the priority for this issue?
My Client is waiting for this solution.

Thanks,
Amaran.

asad.ali · August 7, 2020, 3:44pm

@Amaran

The issue has been logged under free support and it will be investigated and resolved on first come first serve basis. We have recorded your concerns and will surely inform you as soon as we have some certain updates regarding its fix. Please give us some time.

We are sorry for the inconvenience.