How to calculate text index from TextFragment item

dyuen · November 6, 2013, 3:11am

Hi,

Is there any way of finding specific words by text index while searching?

Text index means the position of each character. E.g. “Install a Windows system”, the text index of “Windows” is from 10 to 16.

here I have to implement a function that highlight specific words (e.g. try to highlight ‘Windows’ words in the second line on the last words) on single page, but there always contains many same ‘Windows’ words on one page, so that all ‘Windows’ words are highlighted when searching text from PDF page.

More detailed example is as below:
(article text)
line 1: "Windows 95 Users"
line 2: "… within Windows 95 using their Internet Explorer …Install your Windows."
line 3: "The same general rules and limitations exist for Windows users…"
more lines…

As you can see it contains several “Windows” words, I need highlight the “Windows” words in line 2 marked in red.

I have the text index of keyword before searching, but I’ve no idea how make it working with Aspose APIs, do you have any suggestion?

tilal.ahmad · November 7, 2013, 1:32am

Hi Doris,

Thanks for your inquiry. Please check the following sample code snippet to replace any instance of text; hopefully, it will help you to accomplish your requirements. Moreover, please also check documentation link for more details to work with text using Aspose.Pdf for .NET.

//open document
Document pdfDocument = new Document("input.pdf");

// create TextAbsorber object to find all instances of the input search phrase
TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber("Windows");

// accept the absorber for specific page no.
pdfDocument.Pages[1].Accept(textFragmentAbsorber);
pdfDocument.Pages.Accept(textFragmentAbsorber);

// get the extracted text fragments
TextFragmentCollection textFragmentCollection = textFragmentAbsorber.TextFragments;

// get the first occurrence of text and replace
TextFragment textFragment = textFragmentCollection[3];

// update the required properties
textFragment.TextState.Font = FontRepository.FindFont("Verdana");
textFragment.TextState.FontSize = 22;
textFragment.TextState.ForegroundColor = Aspose.Pdf.Color.FromRgb(System.Drawing.Color.Red);

// save updated PDF
pdfDocument.Save("output.pdf");

Please feel free to contact us for any further assistance.

Best Regards,

dyuen · November 7, 2013, 1:53am

I’m afraid it doesn’t work for me.

It is not sure which index of TextFragments I should take before searching, so TextFragment textFragment = textFragmentCollection[3] is not suitable in my case, rather, I have Text Index of “Windows” (e.g. from 113 to 120) before my searching, the key point is how to ensure the Text Index of TextFragmentCollection[3] equals to “Windows” text index (from 113 to 120).

Therefore, it would be great if there are Aspose APIs working with Text Index.

tilal.ahmad · November 7, 2013, 11:42pm

Hi Doris,

Thanks for sharing more details. I’m afraid the requested functionality is not available at the moment. However, We’ve logged a enhancement request as PDFNEWNET-36022 in our issue tracking system for further investigation and resolution. We will keep you updated about the issue progress via this forum thread.

Best Regards,

desR · August 26, 2016, 8:27am

Is this functionality available in latest version of Aspose.PDF?

I needed to use the TextIndex of the TextFragment?

tilal.ahmad · August 29, 2016, 1:42am

Hi Rupali,

Thanks for your inquiry. I am afraid the reported issues is still not resolved due to its low priority, as product team is resolving other high priority tasks. However I have raised the issue priority, requested our product team to complete the investigation and share an ETA/solution at their earliest. We will update you as soon as we get a feedback.

We are sorry for the inconvenience.

Best Regards,

aspose.notifier · October 6, 2016, 9:52am

The issues you have found earlier (filed as PDFNEWNET-36022) have been fixed in Aspose.Pdf for .NET 16.10.0.

This message was posted using Notification2Forum from Downloads module by Aspose Notifier.

tilal.ahmad · October 26, 2016, 8:29pm

Hi Rupali,

Thanks for your patience. We have investigated the issue and would like to suggest the following solution. Please note that one of the key features of the PDF document format is that it contains no definition of a “line,” and accordingly, “index in the line” is meaningless. PDF operates with text segments (text show operators) that are absolutely positioned on the page anywhere. Therefore, we may operate only with the index of the fragment (e.g., word) in the PDF (physical) text segment.

Furthermore, mostly PDF text segments represent one line in the text. But sometimes, it may represent a part of the line. Therefore, the index of the fragment in the physical text segment will also often be the index in the line. But it is not sure when you deal with third-party documents.

We have added two read-only properties StartCharIndex and EndCharIndex in the TextSegment class. It gets the starting/ending character index of the current segment in the show text operator (Tj, TJ) segment. Please consider the following code:

TextFragmentAbsorber absorber = new TextFragmentAbsorber("Windows");
Document doc = new Document(inFile);
doc.Pages[1].Accept(absorber);

foreach (TextFragment textFragment in absorber.TextFragments)
{
    int position = textFragment.Segments[1].StartCharIndex;
    Console.WriteLine(String.Format("Starting position of '{0}' word is {1} in the text show operator.", textFragment.Text, position));
}

If you find a document where a ‘line’ consists of several show text operators, we recommend the following workflow to get the text index. However, if you face any issue, please share the problematic document. We will look into it and guide you accordingly.

Absorb all text operators segments as fragments using TextFragmentAbsorber with no parameters;
Iterate through fragments that were found;
Sum the lengths of all fragments with the same Y-indent in ascending order of X-indent until the fragment with (second) occurrence of the ‘Windows’ word in the fragments text;
Add the StartCharIndex value to the sum.

Best Regards,

npulipati · November 21, 2016, 5:12am

Hi Support,

This Forum is very useful for my requirement.

i want to make blank text after that particular word using aspose PDF

Please help in above task

tilal.ahmad · November 22, 2016, 2:07am

Hi there,

Thanks for your inquiry. You may search a particular word and replace it with same word plus some spaces. For example replace “text” with "text ". Please check following documentation link to replace text in a PDF document. Hopefully it will help you to accomplish the task.

Replace text in PDF documents.

Please feel free to contact us for any further assistance.

Best Regards,