Extract a text from PDF as a set of rows

groupdocs · August 7, 2015, 11:04am

Hi there,

I’m using Aspose.Pdf for .NET and I want to extract a text from particular page of a PDF document. But I need not a pure text (System.String), which can be obtained using “TextAbsorber” and “TextDevice”, but text, which is represented by the “TextFragment” instances, with additional information such font, location etc. I know how to do this using “Aspose.Pdf.Text.TextFragmentAbsorber”, my source code example is below. But, the problem is that when using “TextFragmentAbsorber” and when it returns a “TextFragmentCollection”, each item in this collection is a distinct word, or several words, or even a space character. When working with different documents, it is impossible to predict, how much text will be collected into a single “TextFragment” instance.

So, my question is: is it possible to extract a text from PDF using Aspose.Pdf in a way that this text will be grouped by rows? Something like “System.IO.File.ReadAllLines” method, which returns a collection of all lines (rows) on a page, and each line is represented with collection of all "TextFragment"s inside this line.

This source code shows how to obtain a set of “TextFragment” instances. It uses “candy.pdf” file, which is attached. As you can see, using this approach, it is impossible to define, which text fragments are part of the first line in the document, which are from the second, and so on. If you will iterate through loop, you’ll see that most of these text fragments are represented by a single space character.

string full_path = folder_name + “candy.pdf”;
Aspose.Pdf.Document doc = new Aspose.Pdf.Document(full_path);
Aspose.Pdf.Page firstPage = doc.Pages[1];
Aspose.Pdf.Text.TextFragmentAbsorber absorber = new Aspose.Pdf.Text.TextFragmentAbsorber();
firstPage.Accept(absorber);
Aspose.Pdf.Text.TextFragmentCollection collection = absorber.TextFragments;

foreach (TextFragment oneTextFragment in collection)
{
string text = oneTextFragment.Text;//not a row
}

Thanks and waiting for your response.

With best regards,
Denis Gvardionov

codewarior · August 10, 2015, 8:28am

Hi Denis,

Thanks for contacting support.

I
have tested the scenario using shared candy.pdf where I have used the same code snippet as shared above and as per my observations, some TextFragments appear as single character, blank space or just contains few words. For the
sake of correction, I have logged it in our issue tracking system as PDFNEWNET-39163. We
will investigate this issue in details and will keep you updated on the status
of a correction. We
apologize for your inconvenience.

However during my testing, I have also observed that when selecting the PDF file contents, the text does not appear as single entity (does not appear as single Fragment but a combination of blank characters, few words or a chunk of few words). Please take a look over attached image file. BTW, when using the same code snippet over another PDF file, the TextFragment for each line is extracted as separate element.

tilal.ahmad · August 10, 2015, 9:07am

Hi Denis,

In addition to the above reply, you may group words of a row based on the LLY coordinate of TextFragment Rect. Please check the sample code snippet; you may refine or improve it as per your need. Hopefully, it will help you accomplish the task.

string full_path = myDir + "candy.pdf";

System.Text.StringBuilder builder = new System.Text.StringBuilder();
double LLY_position = 0;

// String to hold extracted text
string extractedText = "";
Aspose.Pdf.Document doc = new Aspose.Pdf.Document(full_path);
Aspose.Pdf.Page firstPage = doc.Pages[1];
Aspose.Pdf.Text.TextFragmentAbsorber absorber = new Aspose.Pdf.Text.TextFragmentAbsorber();

firstPage.Accept(absorber);
Aspose.Pdf.Text.TextFragmentCollection collection = absorber.TextFragments;

foreach (TextFragment oneTextFragment in collection)
{
   if (oneTextFragment.Rectangle.LLY == LLY_position)
   {
        string text = oneTextFragment.Text; // not a row
        extractedText = oneTextFragment.Text;
        builder.Append(extractedText);
   }
   else if (oneTextFragment.Rectangle.LLY != LLY_position)
   {
        Console.WriteLine("line text: {0}", builder.ToString());
        extractedText = "";
        builder.Clear();
        string text = oneTextFragment.Text; // not a row
        extractedText = oneTextFragment.Text;
        builder.Append(extractedText);
        LLY_position = oneTextFragment.Rectangle.LLY;
   }
}

Please feel free to contact us for any further assistance.

Best Regards,

asad.ali · December 27, 2018, 8:16pm

@groupdocs

Thanks for your patience.

We have investigated the issue and found there was no any bug. The point is that TextFragmentAbsorber with no parameters extracts physical text segments as fragments. (See ‘candy (1) segments.png’ as example which text segments are present in the document.)

candy (1) segments.png (322.2 KB)

In order to instruct TextFragmentAbsorber to find text lines as fragments, it is necessary to use corresponding regular expression.

Please consider the following code:

Aspose.Pdf.Document doc = new Aspose.Pdf.Document(myDir + "candy (1).pdf");
Aspose.Pdf.Page firstPage = doc.Pages[1];

TextSearchOptions options = new TextSearchOptions(true);

Aspose.Pdf.Text.TextFragmentAbsorber absorber = new Aspose.Pdf.Text.TextFragmentAbsorber("(?m).*$", options);
firstPage.Accept(absorber);
Aspose.Pdf.Text.TextFragmentCollection collection = absorber.TextFragments;

foreach (TextFragment oneTextFragment in collection)
{
string text = oneTextFragment.Text;//not a row
Console.WriteLine(String.Format("Extracted Text = '{0}'", text));
}

Console output: 39163_console_out.png (50.2 KB)

Please use above code snippet with Aspose.PDF for .NET 18.12 and in case of any further assistance, please feel free to let us know.