Hi,
Is there any way of finding specific words by text index while searching?
Text index means the position of each character. E.g. “Install a Windows system”, the text index of “Windows” is from 10 to 16.
here I have to implement a function that highlight specific words (e.g. try to highlight ‘Windows’ words in the second line on the last words) on single page, but there always contains many same ‘Windows’ words on one page, so that all ‘Windows’ words are highlighted when searching text from PDF page.
More detailed example is as below:
(article text)
line 1: "Windows 95 Users"
line 2: "… within Windows 95 using their Internet Explorer …Install your Windows."
line 3: "The same general rules and limitations exist for Windows users…"
more lines…
As you can see it contains several “Windows” words, I need highlight the “Windows” words in line 2 marked in red.
I have the text index of keyword before searching, but I’ve no idea how make it working with Aspose APIs, do you have any suggestion?
Hi Doris,
Thanks for your inquiry. Please check the following sample code snippet to replace any instance of text; hopefully, it will help you to accomplish your requirements. Moreover, please also check documentation link for more details to work with text using Aspose.Pdf for .NET.
//open document
Document pdfDocument = new Document("input.pdf");
// create TextAbsorber object to find all instances of the input search phrase
TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber("Windows");
// accept the absorber for specific page no.
pdfDocument.Pages[1].Accept(textFragmentAbsorber);
pdfDocument.Pages.Accept(textFragmentAbsorber);
// get the extracted text fragments
TextFragmentCollection textFragmentCollection = textFragmentAbsorber.TextFragments;
// get the first occurrence of text and replace
TextFragment textFragment = textFragmentCollection[3];
// update the required properties
textFragment.TextState.Font = FontRepository.FindFont("Verdana");
textFragment.TextState.FontSize = 22;
textFragment.TextState.ForegroundColor = Aspose.Pdf.Color.FromRgb(System.Drawing.Color.Red);
// save updated PDF
pdfDocument.Save("output.pdf");
Please feel free to contact us for any further assistance.
Best Regards,
I’m afraid it doesn’t work for me.
It is not sure which index of TextFragments I should take before searching, so TextFragment textFragment = textFragmentCollection[3] is not suitable in my case, rather, I have Text Index of “Windows” (e.g. from 113 to 120) before my searching, the key point is how to ensure the Text Index of TextFragmentCollection[3] equals to “Windows” text index (from 113 to 120).
Therefore, it would be great if there are Aspose APIs working with Text Index.
Hi Doris,
Is this functionality available in latest version of Aspose.PDF?
Hi Rupali,
The issues you have found earlier (filed as PDFNEWNET-36022) have been fixed in Aspose.Pdf for .NET 16.10.0.
This message was posted using Notification2Forum from Downloads module by Aspose Notifier.
Hi Rupali,
Thanks for your patience. We have investigated the issue and would like to suggest the following solution. Please note that one of the key features of the PDF document format is that it contains no definition of a “line,” and accordingly, “index in the line” is meaningless. PDF operates with text segments (text show operators) that are absolutely positioned on the page anywhere. Therefore, we may operate only with the index of the fragment (e.g., word) in the PDF (physical) text segment.
Furthermore, mostly PDF text segments represent one line in the text. But sometimes, it may represent a part of the line. Therefore, the index of the fragment in the physical text segment will also often be the index in the line. But it is not sure when you deal with third-party documents.
We have added two read-only properties StartCharIndex
and EndCharIndex
in the TextSegment
class. It gets the starting/ending character index of the current segment in the show text operator (Tj, TJ) segment. Please consider the following code:
TextFragmentAbsorber absorber = new TextFragmentAbsorber("Windows");
Document doc = new Document(inFile);
doc.Pages[1].Accept(absorber);
foreach (TextFragment textFragment in absorber.TextFragments)
{
int position = textFragment.Segments[1].StartCharIndex;
Console.WriteLine(String.Format("Starting position of '{0}' word is {1} in the text show operator.", textFragment.Text, position));
}
If you find a document where a ‘line’ consists of several show text operators, we recommend the following workflow to get the text index. However, if you face any issue, please share the problematic document. We will look into it and guide you accordingly.
- Absorb all text operators segments as fragments using
TextFragmentAbsorber
with no parameters; - Iterate through fragments that were found;
- Sum the lengths of all fragments with the same Y-indent in ascending order of X-indent until the fragment with (second) occurrence of the ‘Windows’ word in the fragments text;
- Add the
StartCharIndex
value to the sum.
Best Regards,
Hi Support,
Hi there,