Extract a text from PDF as a set of rows

Hi there,

I’m using Aspose.Pdf for .NET and I want to extract a text from particular page of a PDF document. But I need not a pure text (System.String), which can be obtained using “TextAbsorber” and “TextDevice”, but text, which is represented by the “TextFragment” instances, with additional information such font, location etc. I know how to do this using “Aspose.Pdf.Text.TextFragmentAbsorber”, my source code example is below. But, the problem is that when using “TextFragmentAbsorber” and when it returns a “TextFragmentCollection”, each item in this collection is a distinct word, or several words, or even a space character. When working with different documents, it is impossible to predict, how much text will be collected into a single “TextFragment” instance.

So, my question is: is it possible to extract a text from PDF using Aspose.Pdf in a way that this text will be grouped by rows? Something like “System.IO.File.ReadAllLines” method, which returns a collection of all lines (rows) on a page, and each line is represented with collection of all "TextFragment"s inside this line.

This source code shows how to obtain a set of “TextFragment” instances. It uses “candy.pdf” file, which is attached. As you can see, using this approach, it is impossible to define, which text fragments are part of the first line in the document, which are from the second, and so on. If you will iterate through loop, you’ll see that most of these text fragments are represented by a single space character.

string full_path = folder_name + “candy.pdf”;
Aspose.Pdf.Document doc = new Aspose.Pdf.Document(full_path);
Aspose.Pdf.Page firstPage = doc.Pages[1];
Aspose.Pdf.Text.TextFragmentAbsorber absorber = new Aspose.Pdf.Text.TextFragmentAbsorber();
firstPage.Accept(absorber);
Aspose.Pdf.Text.TextFragmentCollection collection = absorber.TextFragments;

foreach (TextFragment oneTextFragment in collection)
{
string text = oneTextFragment.Text;//not a row
}

Thanks and waiting for your response.

With best regards,
Denis Gvardionov

Hi Denis,


Thanks for contacting support.

I
have tested the scenario using shared candy.pdf where I have used the same code snippet as shared above and as per my observations, some TextFragments appear as single character, blank space or just contains few words. For the
sake of correction, I have logged it in our issue tracking system as PDFNEWNET-39163. We
will investigate this issue in details and will keep you updated on the status
of a correction. We
apologize for your inconvenience.


However during my testing, I have also observed that when selecting the PDF file contents, the text does not appear as single entity (does not appear as single Fragment but a combination of blank characters, few words or a chunk of few words). Please take a look over attached image file. BTW, when using the same code snippet over another PDF file, the TextFragment for each line is extracted as separate element.

Hi Denis,

In addition to above reply. You may group words of a row on basis of LLY coordinate of TextFragment Rect. Please check sample code snippet, you may refine/improve it as per your need. Hopefully it will help you to accomplish the task.

string full_path = myDir + “candy.pdf”;<o:p></o:p>

System.Text.StringBuilder builder = new System.Text.StringBuilder();

double LLY_position=0;

//string to hold extracted text

string extractedText = "";

Aspose.Pdf.Document doc = new Aspose.Pdf.Document(full_path);

Aspose.Pdf.Page firstPage = doc.Pages[1];

Aspose.Pdf.Text.TextFragmentAbsorber absorber = new Aspose.Pdf.Text.TextFragmentAbsorber();

firstPage.Accept(absorber);

Aspose.Pdf.Text.TextFragmentCollection collection = absorber.TextFragments;

foreach (TextFragment oneTextFragment in collection)

{

if (oneTextFragment.Rectangle.LLY == LLY_position)

{

string text = oneTextFragment.Text;//not a row

extractedText = oneTextFragment.Text;

builder.Append(extractedText);

}

else if(oneTextFragment.Rectangle.LLY != LLY_position)

{

Console.WriteLine("line text: {0}",builder.ToString());

extractedText = "";

builder.Clear();

string text = oneTextFragment.Text;//not a row

extractedText = oneTextFragment.Text;

builder.Append(extractedText);

LLY_position = oneTextFragment.Rectangle.LLY;

}

}

Please feel free to contact us for any further assistance.

Best Regards,

@groupdocs

Thanks for your patience.

We have investigated the issue and found there was no any bug. The point is that TextFragmentAbsorber with no parameters extracts physical text segments as fragments. (See ‘candy (1) segments.png’ as example which text segments are present in the document.)

candy (1) segments.png (322.2 KB)

In order to instruct TextFragmentAbsorber to find text lines as fragments, it is necessary to use corresponding regular expression.

Please consider the following code:

Aspose.Pdf.Document doc = new Aspose.Pdf.Document(myDir + "candy (1).pdf");
Aspose.Pdf.Page firstPage = doc.Pages[1];

TextSearchOptions options = new TextSearchOptions(true);

Aspose.Pdf.Text.TextFragmentAbsorber absorber = new Aspose.Pdf.Text.TextFragmentAbsorber("(?m).*$", options);
firstPage.Accept(absorber);
Aspose.Pdf.Text.TextFragmentCollection collection = absorber.TextFragments;

foreach (TextFragment oneTextFragment in collection)
{
string text = oneTextFragment.Text;//not a row
Console.WriteLine(String.Format("Extracted Text = '{0}'", text));
}

Console output: 39163_console_out.png (50.2 KB)

Please use above code snippet with Aspose.PDF for .NET 18.12 and in case of any further assistance, please feel free to let us know.