SaveFormat Text?

rozeboosje · November 28, 2019, 5:59pm

There is no Aspose.Pdf.SaveFormat.Text - In order to extract the text out of a PDF I have to first use Aspose.PDF to save with Aspose.PDF.SaveFormat.DocX and then use Aspose.Words to save with Aspose.Words.SaveFormat.Text …

While this works it’s not exactly terribly efficient. Any plans to add “Text” to Aspose.PDF.SaveFormat?

asad.ali · November 28, 2019, 7:12pm

@rozeboosje

You can use TextAbsorber class in order to extract all text from PDF and save it as .txt file. Please check following code snippet:

Aspose.Pdf.Document pdfDocument = new Aspose.Pdf.Document(dataDir + "test.pdf");
Aspose.Pdf.Text.TextAbsorber textAbsorber = new Aspose.Pdf.Text.TextAbsorber();
pdfDocument.Pages.Accept(textAbsorber);
File.WriteAllText(dataDir + "testPDF.txt", textAbsorber.Text);

rozeboosje · November 29, 2019, 12:22pm

ok…

I used the TextAbsorber and the results are less than satisfactory.

Find attached the PDF I tried this with.wicklow 4 thorntons calendar.pdf (1.3 MB)

with the Text Absorber the retrieved text contains things such as:

2018Let your CfrieOnLdsL kEnCowT IONabout theDA greTEatS service

Looking at this, it seems that it has mashed up the visible text “2018 COLLECTION DATES” with a hidden piece of text “Let your friends know about the great service”

I tried setting the ExtractionMode.FormattingMode to Pure and the TextSearchOptions.IgnoreShadowText to True but I still get the same kind of output throughout the extracted Text

When I first convert to an Aspose.Words stream and I then extract the text from there I get the “2018 COLLECTION DATES” text alright, but the “Let your friends know about the great service” is not included in the output, even though you can search for and find it in the original PDF

Funnily enough… we are trying to move away from an old piece of software - the DTSearch File Converter… on this particular occasion it actually works better than either method outlined above. It extracts both “2018 COLLECTION DATES” and also “Let your friends know about the great service” as two separate lines of text. But we really want to be able to get rid of that software as it won’t allow us to move forward with our software

rozeboosje · November 29, 2019, 12:26pm

Ahhhhhhhh never mind… “Pure” is exactly what I don’t want. When I change it to “raw” it works perfectly.

asad.ali · November 29, 2019, 7:09pm

@rozeboosje

It is good to know that your issue has been resolved. Please keep using our API and in case you face any issue, please feel free to let us know.