ok…
I used the TextAbsorber and the results are less than satisfactory.
Find attached the PDF I tried this with.wicklow 4 thorntons calendar.pdf (1.3 MB)
with the Text Absorber the retrieved text contains things such as:
2018Let your CfrieOnLdsL kEnCowT IONabout theDA greTEatS service
Looking at this, it seems that it has mashed up the visible text “2018 COLLECTION DATES” with a hidden piece of text “Let your friends know about the great service”
I tried setting the ExtractionMode.FormattingMode to Pure and the TextSearchOptions.IgnoreShadowText to True but I still get the same kind of output throughout the extracted Text
When I first convert to an Aspose.Words stream and I then extract the text from there I get the “2018 COLLECTION DATES” text alright, but the “Let your friends know about the great service” is not included in the output, even though you can search for and find it in the original PDF
Funnily enough… we are trying to move away from an old piece of software - the DTSearch File Converter… on this particular occasion it actually works better than either method outlined above. It extracts both “2018 COLLECTION DATES” and also “Let your friends know about the great service” as two separate lines of text. But we really want to be able to get rid of that software as it won’t allow us to move forward with our software 