Hi there,
I’m using Aspose.Pdf for .NET and I want to extract a text from particular page of a PDF document. But I need not a pure text (System.String), which can be obtained using “TextAbsorber” and “TextDevice”, but text, which is represented by the “TextFragment” instances, with additional information such font, location etc. I know how to do this using “Aspose.Pdf.Text.TextFragmentAbsorber”, my source code example is below. But, the problem is that when using “TextFragmentAbsorber” and when it returns a “TextFragmentCollection”, each item in this collection is a distinct word, or several words, or even a space character. When working with different documents, it is impossible to predict, how much text will be collected into a single “TextFragment” instance.
So, my question is: is it possible to extract a text from PDF using Aspose.Pdf in a way that this text will be grouped by rows? Something like “System.IO.File.ReadAllLines” method, which returns a collection of all lines (rows) on a page, and each line is represented with collection of all "TextFragment"s inside this line.
This source code shows how to obtain a set of “TextFragment” instances. It uses “candy.pdf” file, which is attached. As you can see, using this approach, it is impossible to define, which text fragments are part of the first line in the document, which are from the second, and so on. If you will iterate through loop, you’ll see that most of these text fragments are represented by a single space character.
string full_path = folder_name + “candy.pdf”;
Aspose.Pdf.Document doc = new Aspose.Pdf.Document(full_path);
Aspose.Pdf.Page firstPage = doc.Pages[1];
Aspose.Pdf.Text.TextFragmentAbsorber absorber = new Aspose.Pdf.Text.TextFragmentAbsorber();
firstPage.Accept(absorber);
Aspose.Pdf.Text.TextFragmentCollection collection = absorber.TextFragments;
foreach (TextFragment oneTextFragment in collection)
{
string text = oneTextFragment.Text;//not a row
}
Thanks and waiting for your response.
With best regards,
Denis Gvardionov