Hi,
Is there any way of finding specific words by text index while searching?
Text index means the position of each character. E.g. “Install a Windows system”, the text index of “Windows” is from 10 to 16.
here I have to implement a function that highlight specific words (e.g. try to highlight ‘Windows’ words in the second line on the last words) on single page, but there always contains many same ‘Windows’ words on one page, so that all ‘Windows’ words are highlighted when searching text from PDF page.
More detailed example is as below:
(article text)
line 1: "Windows 95 Users"
line 2: "… within Windows 95 using their Internet Explorer …Install your Windows."
line 3: "The same general rules and limitations exist for Windows users…"
more lines…
As you can see it contains several “Windows” words, I need highlight the “Windows” words in line 2 marked in red.
I have the text index of keyword before searching, but I’ve no idea how make it working with Aspose APIs, do you have any suggestion?
Hi Doris,
//open document<o:p></o:p>
Document pdfDocument = new Document(“input.pdf”);<o:p></o:p>
//create TextAbsorber object to find all instances of the input search phrase<o:p></o:p>
TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber(“Windows”);<o:p></o:p>
//accept the absorber for specific page no.<o:p></o:p>
pdfDocument.Pages[1].Accept(textFragmentAbsorber);<o:p></o:p>
pdfDocument.Pages.Accept(textFragmentAbsorber);<o:p></o:p>
//get the extracted text fragments<o:p></o:p>
TextFragmentCollection textFragmentCollection = textFragmentAbsorber.TextFragments;<o:p></o:p>
//get first occurance of text and replace<o:p></o:p>
TextFragment textFragment = textFragmentCollection[3];<o:p></o:p>
//update the required properties<o:p></o:p>
textFragment.TextState.Font = FontRepository.FindFont(“Verdana”);<o:p></o:p>
textFragment.TextState.FontSize = 22;<o:p></o:p>
textFragment.TextState.ForegroundColor = Aspose.Pdf.Color.FromRgb(System.Drawing.Color.Red);<o:p></o:p>
//save updated PDF<o:p></o:p>
pdfDocument.Save(“output.pdf”);<o:p></o:p>
<o:p> </o:p>
<o:p>Please feel free to contact us for any further assistance.</o:p>
<o:p>
</o:p>
<o:p>Best Regards,</o:p>
I’m afraid it doesn’t work for me.
It is not sure which index of TextFragments I should take before searching, so TextFragment textFragment = textFragmentCollection[3] is not suitable in my case, rather, I have Text Index of “Windows” (e.g. from 113 to 120) before my searching, the key point is how to ensure the Text Index of TextFragmentCollection[3] equals to “Windows” text index (from 113 to 120).
Therefore, it would be great if there are Aspose APIs working with Text Index.
Hi Doris,
Is this functionality available in latest version of Aspose.PDF?
Hi Rupali,
The issues you have found earlier (filed as PDFNEWNET-36022) have been fixed in Aspose.Pdf for .NET 16.10.0.
This message was posted using Notification2Forum from Downloads module by Aspose Notifier.
Hi Rupali,
Thanks for your patience. We have investigated the issue and would like to suggest you following solution. Please note you need to take into account that the one of the key features of PDF document format is that it contains no definition of “line”. And accordingly “index in the line” is meaningless. PDF operates with text segments (text show operators) those are absolutely positioned on the page anywhere. Therefore we may operate only with index of the fragment (e.g. word) in the PDF (physical) text segment.
Furthermore, mostly PDF text segment represents one line in the text. But sometimes it may represents a part of the line. Therefore often index of fragment in physical text segment will be also index in the line. But it is not sure when you deal with third-party documents.
We have added two read-only properties StartCharIndex and EndCharIndex in TextSegment Class. It gets starting / ending character index of current segment in the show text operator (Tj, TJ) segment. Please consider the following code:
TextFragmentAbsorber absorber = new TextFragmentAbsorber(“Windows”);<o:p></o:p>
Document doc = new
Document(inFile);<o:p></o:p>
doc.Pages[1].Accept(absorber);<o:p></o:p>
foreach (TextFragment
textFragment in absorber.TextFragments)<o:p></o:p>
{<o:p></o:p>
int position =
textFragment.Segments[1].StartCharIndex;<o:p></o:p>
Console.WriteLine(String.Format(“Starting position of ‘{0}’ word is {1} in the text
show operator.”, textFragment.Text, position));<o:p></o:p>
}
If you find a document when 'line' consists of several show text operators we recommend following workflow to get text index. However, if you face any issue then please share the problematic document. We will look into it and will guide you accordingly.
1. absorb all text operators segments as fragments using TextFragmentAbsorber with no parameters;
2. iterating fragments that was found;
3. summing lengths of all fragments with same Y-indent in ascending order of X-indent until fragment with (second) occurrence of 'Windows' word in the fragments text;
4. add StartCharIndex value to the sum.
Best Regards,
Hi Support,
Hi there,