Hi All,
I am using Aspose.PDF in my desktop application to for text extraction and populate the same text in a word Index (like appendix with the position of the word in PDF).
the issue I am facing is with TextFragmentAbsorber function which is fragmenting the PDF content in such a way that while looping through the fragments I see the few fragments last ending words getting split.
Like for example PDF text is :" “Plaintiff GIL R. BOWER provides the following written responses, including objections, to the”
First fragment is “Plaintiff GIL R. BOWER provides the following writ”
Second fragment is “ten responses, including objections, to the”.
Here we can see that written word is split so there are two entries in wordIndex 1. writ and 2. ten which is in correct.
image.jpg (102.2 KB)
image.png (960 Bytes)
In image one we can see all the fragments extracted from PDF and highlight shows the issue.
Second image shows the wordIndex where we see the “includ” entry, which is invalid as there is another entry in wordIndex for “ing”.
Code Snippet:
public bool Aspose_GetBoundedSegment(int segmentIndex, int pageID, double left, double top, double right, double bottom, out int nSegCharStart, out int nSegCharsNum)
{
nSegCharStart = 0;
nSegCharsNum = 0;
// Check if the page exists in the dictionary
if (!pageDictionary.TryGetValue(pageID, out var pageTuple))
{
Console.WriteLine("Page not found for the given page ID.");
return false;
}
Aspose.Pdf.Page page = pageTuple.Item2;
// Check if the TextFragmentAbsorber exists in the dictionary
if (!textAbsorberDictionary.TryGetValue(pageID, out var textAbsorber))
{
Console.WriteLine("TextFragmentAbsorber not found for the given page ID.");
return false;
}
TextFragmentCollection textFragments = textAbsorber.TextFragments;
int segmentCount = 0;
int charIndex = 0;
// Iterate through the text fragments to find the bounded segment
foreach (TextFragment fragment in textFragments)
{
foreach (TextSegment segment in fragment.Segments)
{
Aspose.Pdf.Rectangle segmentRect = segment.Rectangle;
// Check if the segment is within the specified rectangle
if (segmentRect.LLX >= left && segmentRect.URY <= top && segmentRect.URX <= right && segmentRect.LLY >= bottom)
{
if (segmentCount == segmentIndex)
{
nSegCharStart = charIndex;
nSegCharsNum = segment.Text.Length;
return true;
}
segmentCount++;
}
charIndex += segment.Text.Length;
}
}
// Ensure out parameters are assigned before returning false
nSegCharStart = -1;
nSegCharsNum = 0;
return false;
}
Program.7z (14.0 KB)
Any help will be of great help.
Thanks in advance.
Ramya.B