Aspose TextFragmentAbsorber is fragmenting with word split

Hi all , I am currently doing an implementation in my desktop application were I extract the text of PDF and check the bounding boxes and this word is then pullulated as word index.

but having issue in with TextFragmentAbsorber as its split the word in fragmenting process as a result we see two entires in wordIndex for one word. For example : “Plaintiff GIL R. BOWER provides the following written responses, including objections, to the” this is fragmented into two .first fragment is “Plaintiff GIL R. BOWER provides the following writ” and second fragment is “ten responses, including objections, to the”.

So we see writ and ten as 2 different entries in word index which is not desired.

Below is the first code snippet:

public bool Aspose_GetBoundedSegment(int segmentIndex, int pageID, double left, double top, double right, double bottom, out int nSegCharStart, out int nSegCharsNum)
{
    nSegCharStart = 0;
    nSegCharsNum = 0;
 
    // Check if the page exists in the dictionary
    if (!pageDictionary.TryGetValue(pageID, out var pageTuple))
    {
        Console.WriteLine("Page not found for the given page ID.");
        return false;
    }
    Aspose.Pdf.Page page = pageTuple.Item2;
    // Check if the TextFragmentAbsorber exists in the dictionary

    if (!textAbsorberDictionary.TryGetValue(pageID, out var textAbsorber))
    {
        Console.WriteLine("TextFragmentAbsorber not found for the given page ID.");
        return false;
    }
    TextFragmentCollection textFragments = textAbsorber.TextFragments;
    int segmentCount = 0;
    int charIndex = 0;
 
    // Iterate through the text fragments to find the bounded segment
    foreach (TextFragment fragment in textFragments)
    {
        foreach (TextSegment segment in fragment.Segments)
        {
            Aspose.Pdf.Rectangle segmentRect = segment.Rectangle;
            // Check if the segment is within the specified rectangle
            if (segmentRect.LLX >= left && segmentRect.URY <= top && segmentRect.URX <= right && segmentRect.LLY >= bottom)
            {
                if (segmentCount == segmentIndex)
                {
                    nSegCharStart = charIndex;
                    nSegCharsNum = segment.Text.Length;
                    return true;
                }
                segmentCount++;
            }
            charIndex += segment.Text.Length;
        }
    }
    // Ensure out parameters are assigned before returning false
    nSegCharStart = -1;
    nSegCharsNum = 0;
    return false;
}

Below is the Second code snippet:

public bool Aspose_GetBoundedSegment1(int segmentIndex, int pageID, double left, double top, double right, double bottom, out int nSegCharStart, out int nSegCharsNum)
{
     nSegCharStart = 0;
     nSegCharsNum = 0;
 
     // Check if the page exists in the dictionary
     if (!pageDictionary.ContainsKey(pageID))
     {
         return false;
     }
     // Get the page from the dictionary
     var page = pageDictionary[pageID].Item2;
     if (page == null)
     {
         return false;
     }
     // Create TextAbsorber object to extract text
     TextAbsorber textAbsorber = new TextAbsorber();
     // Accept the absorber for the current page
     page.Accept(textAbsorber);
     // Get the extracted text
     string extractedText = textAbsorber.Text;
     // Split the text into lines
     string[] lines = extractedText.Split(new[] { "\r\n", "\r", "\n" }, StringSplitOptions.None);
     // Iterate through the lines to find the bounded segment
     int currentCharIndex = 0;
     foreach (string line in lines)
     {
         if (string.IsNullOrEmpty(line))
         {
             currentCharIndex += 1; // Account for the newline character
             continue;
         }
 
         // Create TextFragmentAbsorber with the pattern and enable text search options
         TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber(line);
         page.Accept(textFragmentAbsorber);
         TextFragmentCollection textFragments = textFragmentAbsorber.TextFragments;
         foreach (TextFragment textFragment in textFragments)
         {
             Aspose.Pdf.Rectangle rect = textFragment.Rectangle;
             if (rect.LLX >= left && rect.LLY >= bottom && rect.URX <= right && rect.URY <= top)
             {
                 if (segmentIndex == 0)
                 {
                     nSegCharStart = currentCharIndex;
                     nSegCharsNum = line.Length;
                     return true;
                 }
                 segmentIndex--;
             }
         }
         currentCharIndex += line.Length + 1; // +1 for the newline character
     }
     return false;
}

Both didnt work as expected, word split is still happening want a resolution for text extraction without word getting split.

image.png (22.2 KB)

image.png (33.8 KB)

screenshot of wordIndex Xml for reference.

Any help would be of great help.

Thanks,
Ramya.B

@Ramya_kalicharan
Please attach the document you are working with that this is happening.

Hi Sergei,

Please find the document attached.

PDF_SinglePage.pdf (17.8 KB)

Regards,
Ramya.B

@Ramya_kalicharan
When trying to reproduce the problem and copying the provided code into the development environment, there are many errors and for many it is not obvious how it should be.
Please make a code fragment that shows exactly the part where TextFragmentAbsorber does not work correctly
2.png (46.7 KB)

Hi Sergei,

Sorry for the confusion, those dictionary are used to save the Page Handle.
I have shared the complete file below.
Program.7z (14.0 KB)

Regards,
Ramya.B

@Ramya_kalicharan

We could not download and extract the file that you share as it seems that it contains virus and our system is not letting us download it. Instead of the entire program, please share minimum code sample or specific word that you are searching using the API and it is coming as split. We will test the scenario in our environment and address it accordingly.

Hi Asad,

Sorry to know that there was a problem with download i will try to add the code here:
public class Program
{
static void Main(string[] args)
{
DotNetClass dotNetobj = new DotNetClass();

    // Example file path and page index
    string filePath = "C:\\PDF files\\PDF_SinglePage.pdf";
    //int pageIndex = 0;

    ASPOSE_DOCUMENT documentHandle = Aspose_LoadDocument(filePath, NULL);

	int pageCount = Aspose_GetPageCount(documentHandle);
	//Console.WriteLine($"Page Count: {pageCount}");
	for (int i = 0; i < pageCount; i++)
	{
		int localPageID = Aspose_LoadPage(documentHandle, i);

		//// Example usage of the loaded page
		double pageWidth = Aspose_GetPageWidth(localPageID);
		double pageHeight = Aspose_GetPageHeight(localPageID);

		// Load the text page using the page handle
		bool resultLoadpage = AsposeText_LoadPage(localPageID);
		if (!resultLoadpage)
		{
			RLLOG_ERROR("Failed to load text page");
			return false;
		}
		double left, right, bottom, top;
		left = right = bottom = top = 0;
		int iRect = 1;
		bool resultRect = Aspose_GetRectangle(localPageID, iRect, &left, &right, &bottom, &top);

		if (resultRect)
		{
			int nSegCharStart, nSegCharsNum;
			int nPageSegsNum = Aspose_CountBoundedSegments(localPageID, left, right, bottom, top);
			for (int s = 0; s < nPageSegsNum; ++s)
			{
				bool resultSeg = Aspose_GetBoundedSegment(s, localPageID, left, top, right, bottom, &nSegCharStart, &nSegCharsNum);
				if (resultSeg)
				{
					CString czWord;
					int nWordStart = 0;
					for (int j = 0; j < nSegCharsNum; j++)
					{
						unsigned int unicode = Aspose_GetUnicode(localPageID, nSegCharStart + j);
						unsigned char c = 0;
						if (unicode <= 255)
						{
							c = static_cast<char>(unicode);
						}
						else
						{
							c = ' ';
						}
						if (!Aspose_IsTextGenerated(c) && (c <= 128 && (isalpha(c) || isdigit(c))))
						{
							if (czWord.IsEmpty())
							{
								nWordStart = nSegCharStart;
							}
							czWord += c;
						}
						else
						{

							//RLLOG_DEBUG("Word: " << czWord);
							if (!czWord.IsEmpty())
							{
								AsposeAddWord(pOCRData, czWord, nWordStart, localPageID);
								outFile << "Aspose Word :" << (LPCTSTR)czWord << std::endl;
								czWord.Empty();
							}
							//AddWord(pOCRData, czWord, nWordStart, localPageID);
							
						}
					}
					RLLOG_DEBUG("Aspose Word: " << czWord);
					//AddWord(pOCRData, czWord, nWordStart, localPageID);
					AsposeAddWord(pOCRData, czWord, nWordStart, localPageID);
				}
			}
		}
	}
}

}

public class AsposePDFApi
{

private static readonly Lazy<AsposePDFApi> instance = new Lazy<AsposePDFApi>(() => new AsposePDFApi());

static int pageNumb = 1;

public Dictionary<int, (MemoryStream, Aspose.Pdf.Page)> pageDictionary = new Dictionary<int, (MemoryStream, Aspose.Pdf.Page)>();
public Dictionary<IntPtr, Aspose.Pdf.Page> pagePointerDictionary = new Dictionary<IntPtr, Aspose.Pdf.Page>();

public Dictionary<int, TextFragmentAbsorber> textAbsorberDictionary = new Dictionary<int, TextFragmentAbsorber>();
private Dictionary<IntPtr, Document> documentDictionary = new Dictionary<IntPtr, Document>();

// Private constructor to prevent instantiation
private AsposePDFApi()
{
    pageDictionary = new Dictionary<int, (MemoryStream, Aspose.Pdf.Page)>();
    textAbsorberDictionary = new Dictionary<int, TextFragmentAbsorber>();
}

// Public static method to get the single instance of the class
public static AsposePDFApi Instance
{
    get
    {
        return instance.Value;
    }
}

// Method: Aspose_LoadDocument Load a PDF document from a file.
public IntPtr Aspose_LoadDocument(string filePath, string password)
{
    // Load the document from file
    byte[] fileData = System.IO.File.ReadAllBytes(filePath);
    GCHandle handle = GCHandle.Alloc(fileData, GCHandleType.Pinned);
    return (IntPtr)handle;
}

// Method: Aspose_GetPageCount Get the total number of pages in the loaded PDF document.
public int Aspose_GetPageCount(IntPtr document)
{
    try
    {// Return the number of pages in the document
        GCHandle handle = (GCHandle)document;
        byte[] fileData = (byte[])handle.Target;
        using (var stream = new System.IO.MemoryStream(fileData))
        {
            Document pdfDocument = new Document(stream);
            return pdfDocument.Pages.Count;
        }
    }
    catch (Exception ex)
    {
        //MessageBox.Show($"An error occurred while loading the page: {ex.Message}");
        return -1;
    }

}

//   Method: Aspose_LoadPage Load a text page from the PDF document.
public int Aspose_LoadPage(IntPtr document, int pageIndex)
{
    GCHandle handle = (GCHandle)document;
    byte[] fileData = (byte[])handle.Target;
    var stream = new MemoryStream(fileData);

    Document pdfDocument = new Document(stream);
    Aspose.Pdf.Page pdfPage;
    try
    {
        pdfPage = pdfDocument.Pages[pageIndex + 1]; // Aspose.PDF pages are 1-based
        int pageId = pdfPage.GetHashCode();

        // Check for hash code collision
        if (pageDictionary.ContainsKey(pageId))
        {
            MessageBox.Show($"Hash code collision detected for page ID.", "Information", MessageBoxButtons.OK, MessageBoxIcon.Information);

            throw new InvalidOperationException("Hash code collision detected for page ID.");
        }

        pageDictionary[pageId] = (stream, pdfPage);
        return pageId;
    }
    catch (ArgumentOutOfRangeException ex)
    {
        MessageBox.Show($"Error: Page index {pageIndex} is out of range. {ex.Message}");
        return -1;
    }
    catch (Exception ex)
    {
        MessageBox.Show($"An error occurred while loading the page: {ex.Message}");
        return -1;
    }

}

// Method: Aspose_LoadTextPage Load a text page from the PDF document.
public bool AsposeText_LoadPage(int pageID)
{
    try
    {

        if (pageDictionary.TryGetValue(pageID, out var pageTuple))
        {
            //MessageBox.Show($"PageID : {pageID}", "Information", //MessageBoxButtons.OK, //MessageBoxIcon.Information);
            MemoryStream stream = pageTuple.Item1;
            Aspose.Pdf.Page pagehandle = pageTuple.Item2;

            // Ensure the MemoryStream is not disposed
            if (stream == null || !stream.CanRead)
            {
                Console.WriteLine("MemoryStream is not available or has been disposed.");
                throw new InvalidOperationException("MemoryStream is not available or has been disposed.");
            }

            // Reinitialize the Document object using the MemoryStream
            stream.Seek(0, SeekOrigin.Begin); // Reset the stream position
            Document pdfDocument = new Document(stream);

            // Retrieve the Page from the reinitialized Document
            pagehandle = pdfDocument.Pages[pagehandle.Number];

            // Ensure the page object is not null
            if (pagehandle == null || pagehandle.PageInfo == null)
            {
                throw new InvalidOperationException("Page object is not fully initialized or is null.");
            }


            // Create a TextFragmentAbsorber to find text within the page
            TextFragmentAbsorber textAbsorber = new TextFragmentAbsorber();

            // Accept the absorber to extract text
            pagehandle.Accept(textAbsorber);
            textAbsorberDictionary[pageID] = textAbsorber;
            //Console.WriteLine("PageID");
            return true;
        }
        else
        {
            throw new KeyNotFoundException("Page not found.");
        }
    }
    catch (Exception ex)
    {
        //MessageBox.Show($"Error in Aspose_LoadTextPageFromPage:  {ex.Message}", "Information", //MessageBoxButtons.OK, //MessageBoxIcon.Information);
        // Log or handle the exception as needed
        return false; // Return false if an exception occurs
    }
}

// Function: Aspose_GetPageWidth Get the width of a specific page (exported function).
public double Aspose_GetPageWidth(int pageHandle)
{
    if (pageDictionary.TryGetValue(pageHandle, out var pageTuple))
    {
        return pageTuple.Item2.Rect.Width;
    }
    throw new KeyNotFoundException("Page not found.");
}

// Function: GetPageHeight Get the Height of a specific page (exported function).
public double Aspose_GetPageHeight(int pageHandle)
{
    if (pageDictionary.TryGetValue(pageHandle, out var pageTuple))
    {
        return pageTuple.Item2.Rect.Height;
    }
    throw new KeyNotFoundException("Page not found.");
}

public bool Aspose_GetRectangle(int pageID, int iRect, out double left, out double right, out double bottom, out double top)
{
    left = right = bottom = top = 0;
    try
    {

        if (pageDictionary.TryGetValue(pageID, out var pageTuple))
        {
            Aspose.Pdf.Page page = pageTuple.Item2;

            Aspose.Pdf.Rectangle rect = null;
            switch (iRect)
            {
                case 0:
                    rect = page.Rect;
                    break;
                case 1:
                    rect = page.CropBox;
                    break;
                case 2:
                    rect = page.MediaBox;
                    break;
                case 3:
                    rect = page.CropBox;
                    break;
                case 4:
                    rect = page.TrimBox;
                    break;
                case 5:
                    rect = page.ArtBox;
                    break;
                case 6:
                    rect = page.BleedBox;
                    break;
                default:
                    left = right = bottom = top = 0;
                    return false;
            }

            if (rect != null)
            {
                left = rect.LLX;
                right = rect.URX;
                bottom = rect.LLY;
                top = rect.URY;
                return true;
            }
        }
        else
        {
            throw new KeyNotFoundException("Page not found for the given page ID.");
        }
    }
    catch (Exception ex)
    {
        // Log or handle the exception as needed
        Console.WriteLine($"Error in Aspose_GetRectangle: {ex.Message}");
    }
    return false;

}

public int Aspose_CountBoundedSegments(int pageID, double left, double right, double bottom, double top)
{
    // Retrieve the Page object from the dictionary using PageID
    try
    {
        if (pageDictionary.TryGetValue(pageID, out var pageTuple))
        {
            Aspose.Pdf.Page page = pageTuple.Item2;

            // Retrieve the TextFragmentAbsorber from the dictionary using PageID
            if (textAbsorberDictionary.TryGetValue(pageID, out var textAbsorber))
            {
                TextFragmentCollection textFragments = textAbsorber.TextFragments;

                int segmentCount = 0;

                // Iterate through the text fragments to count the bounded segments
                foreach (TextFragment fragment in textFragments)
                {
                    foreach (TextSegment segment in fragment.Segments)
                    {
                        Aspose.Pdf.Rectangle segmentRect = segment.Rectangle;
                        if (segmentRect.LLX >= left && segmentRect.URY <= top && segmentRect.URX <= right && segmentRect.LLY >= bottom)
                        {
                            segmentCount++;
                        }
                    }
                }

                return segmentCount;

            }
            else
            {
                throw new KeyNotFoundException("TextFragmentAbsorber not found for the given page ID.");
            }
        }
        else
        {
            throw new KeyNotFoundException("Page not found for the given page ID.");
        }
    }
    catch (Exception ex)
    {
        // Log or handle the exception as needed
        Console.WriteLine($"Error in CountBoundedSegments: {ex.Message}");
        return 0;
    }
}

public bool Aspose_GetBoundedSegment(int segmentIndex, int pageID, double left, double top, double right, double bottom, out int nSegCharStart, out int nSegCharsNum)
{
    nSegCharStart = 0;
    nSegCharsNum = 0;

    // Check if the page exists in the dictionary
    if (!pageDictionary.TryGetValue(pageID, out var pageTuple))
    {
        Console.WriteLine("Page not found for the given page ID.");
        return false;
    }

    Aspose.Pdf.Page page = pageTuple.Item2;

    // Check if the TextFragmentAbsorber exists in the dictionary
    if (!textAbsorberDictionary.TryGetValue(pageID, out var textAbsorber))
    {
        Console.WriteLine("TextFragmentAbsorber not found for the given page ID.");
        return false;
    }

    TextFragmentCollection textFragments = textAbsorber.TextFragments;
    int segmentCount = 0;
    int charIndex = 0;

    // Iterate through the text fragments to find the bounded segment
    foreach (TextFragment fragment in textFragments)
    {
        foreach (TextSegment segment in fragment.Segments)
        {
            Aspose.Pdf.Rectangle segmentRect = segment.Rectangle;

            // Check if the segment is within the specified rectangle
            if (segmentRect.LLX >= left && segmentRect.URY <= top && segmentRect.URX <= right && segmentRect.LLY >= bottom)
            {
                if (segmentCount == segmentIndex)
                {
                    nSegCharStart = charIndex;
                    nSegCharsNum = segment.Text.Length;
                    return true;
                }
                segmentCount++;
            }
            charIndex += segment.Text.Length;
        }
    }

    // Ensure out parameters are assigned before returning false
    nSegCharStart = -1;
    nSegCharsNum = 0;
    return false;
}

public uint Aspose_GetUnicode(int pageID, int index)
{
    if (pageDictionary.TryGetValue(pageID, out var pageTuple))
    {
        Aspose.Pdf.Page page = pageTuple.Item2;

        // Create a TextFragmentAbsorber to find text within the page
        TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber();
        page.Accept(textFragmentAbsorber);

        // Get the collection of text fragments
        TextFragmentCollection textFragments = textFragmentAbsorber.TextFragments;
        int currentIndex = 0;

        // Iterate through the text fragments and their segments
        foreach (TextFragment fragment in textFragments)
        {
            foreach (TextSegment segment in fragment.Segments)
            {
                // Check if the current segment contains the character at the specified index
                if (currentIndex + segment.Text.Length > index)
                {
                    return segment.Text[index - currentIndex];
                }
                currentIndex += segment.Text.Length;
            }
        }
    }

    // Return 0 if the index is out of range or character is not found
    return 0;
}


public bool Aspose_IsTextGenerated(char character)
{
    return char.IsWhiteSpace(character) || character == '\n' || character == '\r';
}

}

}

Hope this code snippet should be sufficient.

Regards,
Ramya.B

@Ramya_kalicharan

We apologize for the trouble that you may have faced to get this issue addressed and conveyed. It looks like the main cause of the issue is that the API is not able to extract some words as single line and it is putting a line break by splitting the words. Please check below sample and minimal code sample:

Document doc = new Document(dataDir + "input.pdf");
TextFragmentAbsorber absorber = new TextFragmentAbsorber(@"Some Word", new TextSearchOptions(true));
doc.Pages[1].Accept(absorber);
if(absorber.TextFragments.Count > 0)
{
    foreach(var tf in absorber.TextFragments)
    {
        Console.WriteLine(tf.Text);
    }
}

We request that you please let us know those words in your PDF that API is not extracting as per expectations.

Asad,

I tried the code snippet shared, but this code works for only specific word mentioned with TextFragmentAbsorber instance creation. What I want is all the words of the PDF to be extracted in order to populate it as word index or appendix as in screen shot below.
image.png (960 Bytes)

image.jpg (102.2 KB)

The Visual studio Screenshot shows the Fragments of all the PDF data from TextFragmentAbsorber which is then split by the textFragment Api. I have highlighted the text.

Hope you got my issue :slight_smile:

Regards,
Ramya.B

@Ramya_kalicharan

We have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.

Issue ID(s): PDFNET-58772

You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.