TextFragementAbsorber

dansshin · July 20, 2015, 5:22pm

Hi,

I would like to use our support contract to get some assistance with issue I posted on your forum. Could you please look at the below link and let me know how this could be addressed so all pages are used when doing reg ex search?

tilal.ahmad · July 21, 2015, 2:31am

Hi there,

We are sorry for the delayed response. Please follow the issue in original thread

TextFragmentAbsorber using Regular Expression not spanning multiple pages . We have replicated the issue and logged for further investigation. We will keep you updated about issue resolution progress.

Best Regards,

dansshin · August 20, 2015, 10:42am

Is there any update on this bug?

tilal.ahmad · August 21, 2015, 2:19am

Hi there,

Thanks for your inquiry. I am afraid your reported issue is still not resolved, as our product team is busy in resolving other issues in the queue, reported earlier. We will notify you as soon as we made some significant progress towards issue resolution.

Thanks for your patience and cooperation.

Best Regards,

tilal.ahmad · January 12, 2016, 9:34am

Hi there,

Thanks for your patience. We have investigated the issue and found no bugs in TextFragmentAbsorber.

Please take into account several fact:

.
1. TextFragmentAbsorber performs page-per-page processing of document. TextFragment is an object belonging to a specific page. It may not contain segments of different pages. This architecture has has serious reasons. It is related to PDF structure limitations.
2. TextFragmentAbsorber is mainly designed for getting access to small parts (fragments) of text. (Words, phrases etc.) And for editing properties of it. We do not recommend using TextFragmentAbsorber to extract a large amount of text. More preferable to use TextAbsorber. The reason is that PDF structure allows the segments have arbitrary order in the document. This order may not agree with the order of reading the text by human.
3. TextAbsorber is designed to extract text from document as whole. And TextAbsorber is able to extract text in reading order using TextFormattingMode.Pure option.

So, if you interested in extraction of large parts of text and have no need edit it, the best choice will be use TextAbsorber, but not TextFragmentAbsorber.

Please consider the following code snippet. This code based on Text Absorber and makes work you need. This approach allow to extract great parts of text using regex patterns even across page borders.

Document pdfDocument = new Document(myDir + "HappyClown.pdf");

TextExtractionOptions options = new TextExtractionOptions(TextExtractionOptions.TextFormattingMode.Pure);

TextAbsorber absorber = new TextAbsorber(options);

pdfDocument.Pages.Accept(absorber);

string extractedText = absorber.Text;

//File.WriteAllText(myDir + "AllExtractedText_pure.txt", absorber.Text);

System.Text.RegularExpressions.Regex rgxRecipeTitle = new Regex("[0-9a-zA-Z ]* [|] Land O'Lakes");

MatchCollection matchesRecipeTitle = rgxRecipeTitle.Matches(extractedText);

if (matchesRecipeTitle.Count > 0)

{

Console.WriteLine(matchesRecipeTitle[0].Value);

}

Regex rgxRecipeNumber = new Regex("Recipe #\\d{5}([a-zA-Z])?");

MatchCollection matchesRecipeNumber = rgxRecipeNumber.Matches(extractedText);

if (matchesRecipeNumber.Count > 0)

{

Console.WriteLine(matchesRecipeNumber[0].Value);

}

//This pattern modified to allow ' (U+2019) simbol and to stop on Tip word

Regex rgxDirections = new Regex("Directions(\r\n|\r|\n)[0-9a-zA-Z(\r\n|\r|\n)°@#$%&*+\\-_(),+':;?.,!\\[\\]\\s\\/ è]*Tip");

MatchCollection matchesDirections = rgxDirections.Matches(extractedText);

if (matchesDirections.Count > 0)

{

Console.WriteLine(matchesDirections[0].Value);

}

Regex rgxNutritionFacts = new Regex("Nutrition Facts [(]([0-9a-zA-Z ]*)?[)]");

MatchCollection matchesNutritionFacts = rgxNutritionFacts.Matches(extractedText);

if (matchesNutritionFacts.Count > 0)

{

Console.WriteLine(matchesNutritionFacts[0].Value);

}

Regex rgxCalories = new Regex("Calories: [0-9]*");

MatchCollection matchesCalories = rgxCalories.Matches(extractedText);

if (matchesCalories.Count > 0)

{

Console.WriteLine(matchesCalories[0].Value);

}

Regex rgxCholesterol = new Regex("Cholesterol: [0-9]*mg");

MatchCollection matchesCholesterol = rgxCholesterol.Matches(extractedText);

if (matchesCholesterol.Count > 0)

{

Console.WriteLine(matchesCholesterol[0].Value);

}

Regex rgxCarbohydrates = new Regex("Carbohydrates: [0-9]*g");

MatchCollection matchesCarbohydrates = rgxCarbohydrates.Matches(extractedText);

if (matchesCarbohydrates.Count > 0)

{

Console.WriteLine(matchesCarbohydrates[0].Value);

}

Regex rgxProtein = new Regex("Protein: [0-9]*g");

MatchCollection matchesProtein = rgxProtein.Matches(extractedText);

if (matchesProtein.Count > 0)

{

Console.WriteLine(matchesProtein[0].Value);

}

Regex rgxFat = new Regex("Fat: [0-9]*g");

MatchCollection matchesFat = rgxFat.Matches(extractedText);

if (matchesFat.Count > 0)

{

Console.WriteLine(matchesFat[0].Value);

}

Regex rgxSodium = new Regex("Sodium: [0-9]*mg");

MatchCollection matchesSodium = rgxSodium.Matches(extractedText);

if (matchesSodium.Count > 0)

{

Console.WriteLine(matchesSodium[0].Value);

}

Regex rgxDietaryFiber = new Regex("Dietary Fiber: [0-9]*g");

MatchCollection matchesDietaryFiber = rgxDietaryFiber.Matches(extractedText);

if (matchesDietaryFiber.Count > 0)

{

Console.WriteLine(matchesDietaryFiber[0].Value);

}

AllExtractedText_pure.txt - all text of the document extracted by TextAbsorber with TextFormattingMode.Pure
OutExtractedText_pure.txt - parts of text (pure mode) filtered using regex patterns
AllExtractedText_raw.txt - all text of the document extracted by TextAbsorber with TextFormattingMode.Raw
OutExtractedText_raw.txt - parts of text (raw mode) filtered using regex patterns

Please feel free to contact us for any further assistance.

Best Regards,