Hi,
We are using the Aspose.PDF library to parse text from PDF’s and then doing some matching to text contained in the text fragment collections. Most of the time, we’ve noticed that the text fragments are full lines of text or at least most of a line. But on some of the PDF’s we are trying to use, the text fragments only return a letter or two. Is there a way for the text fragment absorber to recognize lines instead of small word chunks? Below, you will find code on how we are collecting text fragments. Attached you will find one of the PDF’s that is behaving this way. To describe what I’m saying, The title in this particular PDF is “WEST VIRGINIA DIVISION OF BANKING TANGIBLE NET BENEFIT WORKSHEET”. Getting the first 15 text fragments produces “WEST VIRGINIA DIVISION OF” or something close to that. I would expect the first two lines to be 2ish text fragments.
public TextFragment[] CollectTextFragments(Page page)
{
var textFragmentAbsorber = new TextFragmentAbsorber
{
TextSearchOptions =
{
Rectangle = new Rectangle(0, (page.Rect.Height/2), page.Rect.Width, page.Rect.Height),
LimitToPageBounds = true
},
};
page.Accept(textFragmentAbsorber);
var textFragmentCollection = textFragmentAbsorber.TextFragments;
var textFragColl = new TextFragment[textFragmentCollection.Count];
textFragmentCollection.CopyTo(textFragColl, 0);
return textFragColl;
}
Thanks so much for your help!
Phil
Hi Phil,
Document pdfDocument = new Document(@“c:\pdftest\West+Virginia.pdf”);<o:p></o:p>
Page page = pdfDocument.Pages[1];
var textFragmentAbsorber = new TextFragmentAbsorber
{
TextSearchOptions =
{
Rectangle = new Aspose.Pdf.Rectangle(0, (page.Rect.Height/2), page.Rect.Width, page.Rect.Height),
LimitToPageBounds = true
},
};
page.Accept(textFragmentAbsorber);
var textFragmentCollection = textFragmentAbsorber.TextFragments;
//loop through the fragments
foreach (TextFragment textFragment in textFragmentCollection)
{
Console.WriteLine("Text : {0} ", textFragment.Text);
}
Thanks for your patience.
We have investigated the issue and found it was not a bug. It should be taken into account that TextFragment
in Aspose.PDF has no special meaning of ‘word’, ‘line’, etc. Its meaning in text searching scenarios changes according to the search request. TextFragmentAbsorber
with no parameters absorbs physical text segments (page contents text showing operators) as fragments. (See: Acrobat_text_segments.png)
Please use regular expressions such as '\S+'
for absorbing words and '.+'
for absorbing text lines as text fragments.
Please consider the following code snippet:
Document pdfDocument = new Document(myDir + @"West+Virginia.pdf");
Page page = pdfDocument.Pages[1];
TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber(@"\S+");
TextSearchOptions searchOptions = new TextSearchOptions(true)
{
Rectangle = new Aspose.Pdf.Rectangle(0, (page.Rect.Height / 2), page.Rect.Width, page.Rect.Height),
LimitToPageBounds = true
};
textFragmentAbsorber.TextSearchOptions = searchOptions;
page.Accept(textFragmentAbsorber);
var textFragmentCollection = textFragmentAbsorber.TextFragments;
//loop through the fragments
foreach (TextFragment textFragment in textFragmentCollection)
{
Console.WriteLine("Text : {0} ", textFragment.Text);
}
Please use suggested code snippet with Aspose.PDF for .NET 19.1 and in case of any further assistance, please feel free to let us know.