How to calculate text index from TextFragment item

Hi,

Is there any way of finding specific words by text index while searching?

Text index means the position of each character. E.g. “Install a Windows system”, the text index of “Windows” is from 10 to 16.

here I have to implement a function that highlight specific words (e.g. try to highlight ‘Windows’ words in the second line on the last words) on single page, but there always contains many same ‘Windows’ words on one page, so that all ‘Windows’ words are highlighted when searching text from PDF page.

More detailed example is as below:
(article text)
line 1: "Windows 95 Users"
line 2: "… within Windows 95 using their Internet Explorer …Install your Windows."
line 3: "The same general rules and limitations exist for Windows users…"
more lines…

As you can see it contains several “Windows” words, I need highlight the “Windows” words in line 2 marked in red.

I have the text index of keyword before searching, but I’ve no idea how make it working with Aspose APIs, do you have any suggestion?

Hi Doris,


Thanks for your inquiry. Please check following sample code snippet to replace any index(occurance) of text, hopefully it will help you to accomplish your requirements. Moreover please also check documentation link for more details to work with text using Aspose.Pdf for .NET

//open document<o:p></o:p>

Document pdfDocument = new Document(“input.pdf”);<o:p></o:p>

//create TextAbsorber object to find all instances of the input search phrase<o:p></o:p>

TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber(“Windows”);<o:p></o:p>

//accept the absorber for specific page no.<o:p></o:p>

pdfDocument.Pages[1].Accept(textFragmentAbsorber);<o:p></o:p>

pdfDocument.Pages.Accept(textFragmentAbsorber);<o:p></o:p>

//get the extracted text fragments<o:p></o:p>

TextFragmentCollection textFragmentCollection = textFragmentAbsorber.TextFragments;<o:p></o:p>

//get first occurance of text and replace<o:p></o:p>

TextFragment textFragment = textFragmentCollection[3];<o:p></o:p>

//update the required properties<o:p></o:p>

textFragment.TextState.Font = FontRepository.FindFont(“Verdana”);<o:p></o:p>

textFragment.TextState.FontSize = 22;<o:p></o:p>

textFragment.TextState.ForegroundColor = Aspose.Pdf.Color.FromRgb(System.Drawing.Color.Red);<o:p></o:p>

//save updated PDF<o:p></o:p>

pdfDocument.Save(“output.pdf”);<o:p></o:p>

<o:p> </o:p>

<o:p>Please feel free to contact us for any further assistance.</o:p>

<o:p>
</o:p>

<o:p>Best Regards,</o:p>

I’m afraid it doesn’t work for me.

It is not sure which index of TextFragments I should take before searching, so TextFragment textFragment = textFragmentCollection[3] is not suitable in my case, rather, I have Text Index of “Windows” (e.g. from 113 to 120) before my searching, the key point is how to ensure the Text Index of TextFragmentCollection[3] equals to “Windows” text index (from 113 to 120).

Therefore, it would be great if there are Aspose APIs working with Text Index.

Hi Doris,


Thanks for sharing more details. I’m afraid the requested functionality is not available at the moment. However, We’ve logged a enhancement request as PDFNEWNET-36022 in our issue tracking system for further investigation and resolution. We will keep you updated about the issue progress via this forum thread.

Best Regards,

Is this functionality available in latest version of Aspose.PDF?

I needed to use the TextIndex of the TextFragment?

Hi Rupali,


Thanks for your inquiry. I am afraid the reported issues is still not resolved due to its low priority, as product team is resolving other high priority tasks. However I have raised the issue priority, requested our product team to complete the investigation and share an ETA/solution at their earliest. We will update you as soon as we get a feedback.

We are sorry for the inconvenience.

Best Regards,

The issues you have found earlier (filed as PDFNEWNET-36022) have been fixed in Aspose.Pdf for .NET 16.10.0.


This message was posted using Notification2Forum from Downloads module by Aspose Notifier.

Hi Rupali,


Thanks for your patience. We have investigated the issue and would like to suggest you following solution. Please note you need to take into account that the one of the key features of PDF document format is that it contains no definition of “line”. And accordingly “index in the line” is meaningless. PDF operates with text segments (text show operators) those are absolutely positioned on the page anywhere. Therefore we may operate only with index of the fragment (e.g. word) in the PDF (physical) text segment.

Furthermore, mostly PDF text segment represents one line in the text. But sometimes it may represents a part of the line. Therefore often index of fragment in physical text segment will be also index in the line. But it is not sure when you deal with third-party documents.

We have added two read-only properties StartCharIndex and EndCharIndex in TextSegment Class. It gets starting / ending character index of current segment in the show text operator (Tj, TJ) segment. Please consider the following code:


TextFragmentAbsorber absorber = new TextFragmentAbsorber(“Windows”);<o:p></o:p>

Document doc = new
Document(inFile);<o:p></o:p>

doc.Pages[1].Accept(absorber);<o:p></o:p>

foreach (TextFragment
textFragment in absorber.TextFragments)<o:p></o:p>

{<o:p></o:p>

int position =
textFragment.Segments[1].StartCharIndex;<o:p></o:p>

Console.WriteLine(String.Format(“Starting position of ‘{0}’ word is {1} in the text
show operator.”
, textFragment.Text, position));<o:p></o:p>

}


If you find a document when 'line' consists of several show text operators we recommend following workflow to get text index. However, if you face any issue then please share the problematic document. We will look into it and will guide you accordingly.


1. absorb all text operators segments as fragments using TextFragmentAbsorber with no parameters;
2. iterating fragments that was found;
3. summing lengths of all fragments with same Y-indent in ascending order of X-indent until fragment with (second) occurrence of 'Windows' word in the fragments text;
4. add StartCharIndex value to the sum.


Best Regards,

Hi Support,


This Forum is very useful for my requirement.

i want to make blank text after that particular word using aspose PDF

Please help in above task

Hi there,


Thanks for your inquiry. You may search a particular word and replace it with same word plus some spaces. For example replace “text” with "text ". Please check following documentation link to replace text in a PDF document. Hopefully it will help you to accomplish the task.

Please feel free to contact us for any further assistance.

Best Regards,