Highlight will have many space between each word

Hi Aspose team,

We are using PDF lib 20.3, here’s the code we use to highlight PDF files.
image.png (90.3 KB)

We find highlighting some files will have space between each word. Would you please have a look?
image.png (20.5 KB)

result.pdf (200.1 KB)
original.pdf (174.9 KB)

FYI, when using Adobe, it doesn’t have such a problem.
image.png (47.4 KB)

@Glority_Developer

We used the below code snippet with Aspose.PDF for .NET 21.5 to test the scenario and did not notice the issue. Please check the attached PDF document for your kind reference which was generated via below code:

Document doc2 = new Document(dataDir + "original.pdf");
TextFragmentAbsorber tfa = new TextFragmentAbsorber(@"The\s+information\s+contained\s+herein\s+is\s+believed\s+to\s+be\s+accurate\s+as\s+of\s+the\s+date\s+of\s+publication,\s+however,\s+none\s+of\s+the\s+Blu-ray\s+Disc\s+Association,\s+its\s+Members", new TextSearchOptions(true));
doc2.Pages.Accept(tfa);
HighlightAnnotation ha = new HighlightAnnotation(doc2.Pages[2], tfa.TextFragments[1].Rectangle);
ha.Color = Color.Yellow;
doc2.Pages[2].Annotations.Add(ha);
doc2.Save(dataDir + "PDF_Highlighting_2.pdf");

PDF_Highlighting_2.pdf (176.0 KB)

Can you please try using the latest version of the API and let us know in case you still face any issue?

var rect = new Rectangle(leftTopPoint.X, leftTopPoint.Y, rightBottomPoint.X, rightBottomPoint.Y);
var textFragmentAbsorber = new TextFragmentAbsorber();
textFragmentAbsorber.TextSearchOptions = new TextSearchOptions(rect);
 textFragmentAbsorber.Visit(page);
var textList = textFragmentAbsorber.TextFragments.Where(text => !string.IsNullOrEmpty(text.Text)).ToList();

We use the above code to divide the text in the Rectangle. This is the code from Using Text Annotation for PDF|Aspose.PDF for .NET. The problem is that for some texts, the above method divides each letter into a region. Some of the colors of the highlight have this obvious border, and the overlapping borders are very ugly, such as the picture below06B645FE-FBD0-4366-B53F-BD215DA22CD4.png (77.6 KB)

It seems that you haven’t read our previous description. You have seen the text like this, and then it is wrong to write the text directly into the code. We only know which area in the Page has been selected, and we need to divide the text or paragraphs in the area. Or can you tell us how to avoid the border in Highlight, this is ugly. Thank you

@Glority_Developer

We had off course read your first post carefully where you shared the screenshot of code snippet in the documentation. In the shared code snippet, the text was being searched by the same approach which we used in the suggested code snippet.

Furthermore, please also note that in the documentation example, every segment inside a text fragment is being extracted and a highlight annotation is being added around it. A PDF document can contain different structure of text. Sometimes, a text fragment (a text part) can contain a single character within a complete line whereas, it can also consist of multi-line text.

Please do not add highlight annotation around every segment of searched text. Instead, just use obtained text fragment and add an annotation by using its rectangle property. Furthermore, please share the values of leftTopPoint.X, leftTopPoint.Y, rightBottomPoint.X, rightBottomPoint.Y variables with us so that we can also test the scenario in our environment and address the case accordingly.

The key to the problem is how to obtain the text fragment correctly, not fragmentation. Do you have a solution? Or tell me how to eliminate the border of the highlight

@Glority_Developer

In order to obtain text fragment correctly, you can either perform search using regular expressions as we also shared a code snippet in our previous response or you can extract text using rectangle which we believe you are already doing at your end as you mentioned in one of your replies.

Using both ways, you will have text fragment. Now, it is up to you whether you add highlight annotation around obtained text fragment or around segments of a text fragment. For example, in the below code, we obtained a multi-line text from your sample PDF and added highlight annotation around every segment in the obtained text fragment:

Document doc2 = new Document(dataDir + "original.pdf");
TextFragmentAbsorber tfa = new TextFragmentAbsorber(@"The\s+information\s+contained\s+herein\s+is\s+believed\s+to\s+be\s+accurate\s+as\s+of\s+the\s+date\s+of\s+publication,\s+however,\s+none\s+of\s+the\s+Blu-ray\s+Disc\s+Association,\s+its\s+Members", new TextSearchOptions(true));
doc2.Pages.Accept(tfa);

foreach(var textsegment in tfa.TextFragments[1].Segments)
{
 HighlightAnnotation ha = new HighlightAnnotation(doc2.Pages[2], textsegment.Rectangle);
 ha.Color = Color.Yellow;
 doc2.Pages[2].Annotations.Add(ha);
}

doc2.Save(dataDir + "PDF_Highlighting_2.pdf");

PDF_Highlighting_2.pdf (270.1 KB)

Please check in the attached PDF, you will not notice any border as well. This was generated using 21.5 version of the API. Also, as requested earlier, please do not use the code snippet given in the documentation article as it is for a particular scenario where you want to add annotation around every character. Instead, please use the code snippet which we suggested in this forum thread.

Furthermore, in case you face any issue at your end while using our suggested code, please feel free to let us know. In case we still misunderstood your requirements, we apologize in advance and request you to share an expected output PDF along with original sample code snippet that has been implemented at your end for adding a highlight annotation so that we can further proceed accordingly.

We use the same code as yours, only Color is different. As you can see in the picture, there is a thin line between the characters. This is caused by the division of the text. What solutions do you have for this? It is invisible using yellow. 6FF8200C-ECEE-4779-8CBD-5A450F6BCDCF.png (75.6 KB)
10778ABC-5386-4346-AB4E-216437F7A7ED.png (132.2 KB)

@Glority_Developer

We tried to use the below code in order to remove the border overlapping but it did not help.

Document doc2 = new Document(dataDir + "original.pdf");
TextFragmentAbsorber tfa = new TextFragmentAbsorber(@"The\s+information\s+contained\s+herein\s+is\s+believed\s+to\s+be\s+accurate\s+as\s+of\s+the\s+date\s+of\s+publication,\s+however,\s+none\s+of\s+the\s+Blu-ray\s+Disc\s+Association,\s+its\s+Members", new TextSearchOptions(true));
doc2.Pages.Accept(tfa);
foreach (var textsegment in tfa.TextFragments[1].Segments)
{
 HighlightAnnotation ha = new HighlightAnnotation(doc2.Pages[2], textsegment.Rectangle);
 ha.Color = Color.FromArgb(255, 196, 212, 167);
 // Below code did not remove the border effect
 ha.Border = new Border(ha);
 ha.Border.Style = BorderStyle.Inset;
 ha.Border.Width = 0;
 ha.Border.Effect = BorderEffect.None;
 doc2.Pages[2].Annotations.Add(ha);
}

doc2.Save(dataDir + "PDF_Highlighting_2.pdf");

Furthermore, we also noticed that the annotation is not being added to the selected text (it is added till the end of the second line) when using the below code:

Document doc2 = new Document(dataDir + "original.pdf");
TextFragmentAbsorber tfa = new TextFragmentAbsorber(@"The\s+information\s+contained\s+herein\s+is\s+believed\s+to\s+be\s+accurate\s+as\s+of\s+the\s+date\s+of\s+publication,\s+however,\s+none\s+of\s+the\s+Blu-ray\s+Disc\s+Association,\s+its\s+Members", new TextSearchOptions(true));
doc2.Pages.Accept(tfa);
HighlightAnnotation ha = new HighlightAnnotation(doc2.Pages[2], tfa.TextFragments[1].Rectangle);
ha.Color = Color.FromArgb(255, 196, 212, 167);
doc2.Pages[2].Annotations.Add(ha);

Hence, we have logged an issue as PDFNET-50051 in our issue tracking system for the sake of further investigation and rectification. We will look into its details and let you know once the ticket is resolved. Please be patient and spare us some time.

We are sorry for the inconvenience.

Thanks,we also tried setting the width of the border to 0, but it didn’t help. Hope you guys good luck.