TextFragmentAbsorber only grabbing word partials

photchkiss · November 24, 2015, 8:08am

Hi,

We are using the Aspose.PDF library to parse text from PDF’s and then doing some matching to text contained in the text fragment collections. Most of the time, we’ve noticed that the text fragments are full lines of text or at least most of a line. But on some of the PDF’s we are trying to use, the text fragments only return a letter or two. Is there a way for the text fragment absorber to recognize lines instead of small word chunks? Below, you will find code on how we are collecting text fragments. Attached you will find one of the PDF’s that is behaving this way. To describe what I’m saying, The title in this particular PDF is “WEST VIRGINIA DIVISION OF BANKING TANGIBLE NET BENEFIT WORKSHEET”. Getting the first 15 text fragments produces “WEST VIRGINIA DIVISION OF” or something close to that. I would expect the first two lines to be 2ish text fragments.

public TextFragment[] CollectTextFragments(Page page)
{
var textFragmentAbsorber = new TextFragmentAbsorber
{
TextSearchOptions =
{
Rectangle = new Rectangle(0, (page.Rect.Height/2), page.Rect.Width, page.Rect.Height),
LimitToPageBounds = true
},

};
page.Accept(textFragmentAbsorber);

var textFragmentCollection = textFragmentAbsorber.TextFragments;
var textFragColl = new TextFragment[textFragmentCollection.Count];

textFragmentCollection.CopyTo(textFragColl, 0);

return textFragColl;
}

Thanks so much for your help!

Phil

codewarior · November 25, 2015, 3:09am

Hi Phil,

I have tested the scenario and have observed that TextFragments are being extracted as single or multiple characters, instead of complete words. For the sake of correction, I have logged this problem as PDFNEWNET-39751 in our issue tracking system. We will further look into the details of this problem and will keep you posted on the status of correction. Please be patient and spare us little time. We are sorry for this inconvenience.

[C#]

Document pdfDocument = new Document(@“c:\pdftest\West+Virginia.pdf”);<o:p></o:p>

Page page = pdfDocument.Pages[1];

var textFragmentAbsorber = new TextFragmentAbsorber

{

TextSearchOptions =

{

Rectangle = new Aspose.Pdf.Rectangle(0, (page.Rect.Height/2), page.Rect.Width, page.Rect.Height),

LimitToPageBounds = true

},

};

page.Accept(textFragmentAbsorber);

var textFragmentCollection = textFragmentAbsorber.TextFragments;

//loop through the fragments

foreach (TextFragment textFragment in textFragmentCollection)

{

Console.WriteLine("Text : {0} ", textFragment.Text);

}

asad.ali · January 7, 2019, 8:48pm

@photchkiss

Thanks for your patience.

We have investigated the issue and found it was not a bug. It should be taken into account that TextFragment in Aspose.PDF has no special meaning of ‘word’, ‘line’, etc. Its meaning in text searching scenarios changes according to the search request. TextFragmentAbsorber with no parameters absorbs physical text segments (page contents text showing operators) as fragments. (See: Acrobat_text_segments.png)

Please use regular expressions such as '\S+' for absorbing words and '.+' for absorbing text lines as text fragments.

Please consider the following code snippet:

Document pdfDocument = new Document(myDir + @"West+Virginia.pdf");
Page page = pdfDocument.Pages[1];
TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber(@"\S+");
TextSearchOptions searchOptions = new TextSearchOptions(true)
{
    Rectangle = new Aspose.Pdf.Rectangle(0, (page.Rect.Height / 2), page.Rect.Width, page.Rect.Height),
    LimitToPageBounds = true
};
textFragmentAbsorber.TextSearchOptions = searchOptions;
page.Accept(textFragmentAbsorber);
var textFragmentCollection = textFragmentAbsorber.TextFragments;
//loop through the fragments
foreach (TextFragment textFragment in textFragmentCollection)
{
    Console.WriteLine("Text : {0} ", textFragment.Text);
}

Please use suggested code snippet with Aspose.PDF for .NET 19.1 and in case of any further assistance, please feel free to let us know.