Aspose.PDF TextFragmentAbsorber function issue

Hi All,

I am using Aspose.PDF in my desktop application to for text extraction and populate the same text in a word Index (like appendix with the position of the word in PDF).

the issue I am facing is with TextFragmentAbsorber function which is fragmenting the PDF content in such a way that while looping through the fragments I see the few fragments last ending words getting split.

Like for example PDF text is :" “Plaintiff GIL R. BOWER provides the following written responses, including objections, to the”

First fragment is “Plaintiff GIL R. BOWER provides the following writ”
Second fragment is “ten responses, including objections, to the”.

Here we can see that written word is split so there are two entries in wordIndex 1. writ and 2. ten which is in correct.

image.jpg (102.2 KB)

image.png (960 Bytes)

In image one we can see all the fragments extracted from PDF and highlight shows the issue.

Second image shows the wordIndex where we see the “includ” entry, which is invalid as there is another entry in wordIndex for “ing”.

Code Snippet:

public bool Aspose_GetBoundedSegment(int segmentIndex, int pageID, double left, double top, double right, double bottom, out int nSegCharStart, out int nSegCharsNum)
{
nSegCharStart = 0;
nSegCharsNum = 0;

// Check if the page exists in the dictionary
if (!pageDictionary.TryGetValue(pageID, out var pageTuple))
{
    Console.WriteLine("Page not found for the given page ID.");
    return false;
}
Aspose.Pdf.Page page = pageTuple.Item2;
// Check if the TextFragmentAbsorber exists in the dictionary

if (!textAbsorberDictionary.TryGetValue(pageID, out var textAbsorber))
{
    Console.WriteLine("TextFragmentAbsorber not found for the given page ID.");
    return false;
}
TextFragmentCollection textFragments = textAbsorber.TextFragments;
int segmentCount = 0;
int charIndex = 0;

// Iterate through the text fragments to find the bounded segment
foreach (TextFragment fragment in textFragments)
{
    foreach (TextSegment segment in fragment.Segments)
    {
        Aspose.Pdf.Rectangle segmentRect = segment.Rectangle;
        // Check if the segment is within the specified rectangle
        if (segmentRect.LLX >= left && segmentRect.URY <= top && segmentRect.URX <= right && segmentRect.LLY >= bottom)
        {
            if (segmentCount == segmentIndex)
            {
                nSegCharStart = charIndex;
                nSegCharsNum = segment.Text.Length;
                return true;
            }
            segmentCount++;
        }
        charIndex += segment.Text.Length;
    }
}
// Ensure out parameters are assigned before returning false
nSegCharStart = -1;
nSegCharsNum = 0;
return false;

}

Program.7z (14.0 KB)

Any help will be of great help.

Thanks in advance.

Ramya.B

@Ramya_kalicharan

It seems you are encountering an issue with the TextFragmentAbsorber in Aspose.PDF, where the text fragments are being split incorrectly, leading to invalid entries in your word index. This can happen due to how the PDF content is structured and how the text segments are extracted.

To address this issue, you can consider the following approaches:

  1. Post-Processing of Fragments: After extracting the text fragments, you can implement a post-processing step to check for split words. For example, if a fragment ends with a substring that is less than a certain length (like “writ”), you can check the next fragment to see if it starts with the continuation of that word (like “ten”). If so, you can merge them into a single entry in your word index.

  2. Adjusting Text Extraction Logic: You might want to refine your logic for how you are iterating through the text segments. Ensure that you are checking the end of one segment and the beginning of the next segment to handle cases where words are split across segments.

Here is a simplified example of how you might implement the merging logic:

List<string> wordIndex = new List<string>();
string previousFragment = string.Empty;

foreach (TextFragment fragment in textFragments)
{
    foreach (TextSegment segment in fragment.Segments)
    {
        string currentText = segment.Text;

        // Check if the previous fragment ends with a split word
        if (previousFragment.Length > 0 && previousFragment.EndsWith("writ") && currentText.StartsWith("ten"))
        {
            // Merge the fragments
            wordIndex[wordIndex.Count - 1] = previousFragment + currentText; // Merge into the last entry
        }
        else
        {
            wordIndex.Add(currentText); // Add new entry
        }

        previousFragment = currentText; // Update previous fragment
    }
}
  1. Using Regular Expressions: If your text extraction allows for it, you could use regular expressions to identify and correct split words based on common patterns.

  2. Reviewing Documentation and Examples: It may also be beneficial to review the official documentation and examples provided by Aspose for handling text extraction. This can provide insights into best practices and potential configurations that might help mitigate the issue.

If you continue to face challenges, consider sharing a minimal reproducible example of your code along with a sample PDF that exhibits the problem. This can help others in the community provide more targeted assistance.

Feel free to reach out if you have further questions or need additional help!

Sources:
[1]: SearchTextSegmentsPage.cs
[2]: SearchAndGetTextSegmentsFromPagesOfPDF.java

  1. Post-Processing of Fragments: After extracting the text fragments, you can implement a post-processing step to check for split words. For example, if a fragment ends with a substring that is less than a certain length (like “writ”), you can check the next fragment to see if it starts with the continuation of that word (like “ten”). If so, you can merge them into a single entry in your word index.

Reply: Identifying that the word has got split is the issue. We can’t compare with original content here

  1. Adjusting Text Extraction Logic: You might want to refine your logic for how you are iterating through the text segments. Ensure that you are checking the end of one segment and the beginning of the next segment to handle cases where words are split across segments.

Reply : Has I mentioned earlier identifying the word split.

  1. Using Regular Expressions: If your text extraction allows for it, you could use regular expressions to identify and correct split words based on common patterns.

Reply: i have tried this didnt work.
code snippet:
public bool Aspose_GetBoundedSegment1(int segmentIndex, int pageID, double left, double top, double right, double bottom, out int nSegCharStart, out int nSegCharsNum)
{
nSegCharStart = 0;
nSegCharsNum = 0;

// Check if the page exists in the dictionary
if (!pageDictionary.ContainsKey(pageID))
{
    return false;
}

// Get the page from the dictionary
var page = pageDictionary[pageID].Item2;
if (page == null)
{
    return false;
}

// Create TextAbsorber object to extract text
TextAbsorber textAbsorber = new TextAbsorber();

// Accept the absorber for the current page
page.Accept(textAbsorber);

// Get the extracted text
string extractedText = textAbsorber.Text;

// Split the text into lines
string[] lines = extractedText.Split(new[] { "\r\n", "\r", "\n" }, StringSplitOptions.None);

// Iterate through the lines to find the bounded segment
int currentCharIndex = 0;
foreach (string line in lines)
{
    if (string.IsNullOrEmpty(line))
    {
        currentCharIndex += 1; // Account for the newline character
        continue;
    }
    // Create TextFragmentAbsorber with the pattern and enable text search options
    TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber(line);

    page.Accept(textFragmentAbsorber);
    TextFragmentCollection textFragments = textFragmentAbsorber.TextFragments;

    foreach (TextFragment textFragment in textFragments)
    {
        Aspose.Pdf.Rectangle rect = textFragment.Rectangle;
        if (rect.LLX >= left && rect.LLY >= bottom && rect.URX <= right && rect.URY <= top)
        {
            if (segmentIndex == 0)
            {
                nSegCharStart = currentCharIndex;
                nSegCharsNum = line.Length;
                return true;
            }
            segmentIndex--;
        }
    }
    currentCharIndex += line.Length + 1; // +1 for the newline character
}
return false;

}

  1. Reviewing Documentation and Examples: It may also be beneficial to review the official documentation and examples provided by Aspose for handling text extraction. This can provide insights into best practices and potential configurations that might help mitigate the issue.

Reply: I have checked all possible posts and documentation still didnt get a breakthrough for this.

you can just try call below function instead of taking the complete code which is shared earlier.

public bool Aspose_GetBoundedSegment(int segmentIndex, int pageID, double left, double top, double right, double bottom, out int nSegCharStart, out int nSegCharsNum)
{
nSegCharStart = -1;
nSegCharsNum = 0;

var page = //initialize with a PDF page 
var textAbsorber = new TextFragmentAbsorber();
page.Accept(textAbsorber);

var textFragments = textAbsorber.TextFragments;
List<TextFragment> mergedFragments = MergeTextFragments(textFragments);

int currentSegmentIndex = 0;

foreach (TextFragment fragment in mergedFragments)
{
    foreach (TextSegment segment in fragment.Segments)
    {
        var segmentBBox = segment.Rectangle;
        if (segmentBBox.LLX >= left && segmentBBox.LLY >= bottom && segmentBBox.URX <= right && segmentBBox.URY <= top)
        {
            if (currentSegmentIndex == segmentIndex)
            {
                nSegCharStart = segment.Text.Length;
                nSegCharsNum = segment.Text.Length;
                return true;
            }
            currentSegmentIndex++;
        }
    }
}

return false;

}

This is the code which fragments the PDF content and loops through each to get the text and its co-ordinates.

Regards,
Ramya

@Ramya_kalicharan

Can you please share your sample PDF document with us as well? We will test the scenario in our environment and address it accordingly.

Hi Asad,

Please find the PDF below.

PDF_SinglePage.pdf (17.8 KB)

Will look forward for your reply.
Regards,
Ramya.B

@Ramya_kalicharan

Looks like you have reported similar issue before and it has already been logged in our issue management system for the sake of correction under the ticket ID PDFNET-58772. We have attached the ticket with this forum thread as well so that you will receive a notification as soon as the ticket is resolved. Please be patient and spare us some time.

We are sorry for the inconvenience.