Hi!
I faced out with a following problem.
I have two ways to check if PDF file contains any text. For the file with size 307 KB this takes from 14 to 55 seconds.
First way from documentation (https://docs.aspose.com/pdf/net/find-whether-pdf-file-contains-images-or-text-only/):
var ms = new MemoryStream();
var extractor = new PdfExtractor();
extractor.BindPdf(pdfStream);
extractor.ExtractText();
extractor.GetText(ms);
bool containsText = ms.Length >= 1;
With this check, it takes 40 to 55 seconds.
The second way I wrote myself
Document pdfDocument = new Document(pdfStream);
var textSearchOptions = new TextSearchOptions(false)
{
IgnoreResourceFontErrors = true,
IgnoreShadowText = true,
IsRegularExpressionUsed = false,
LogTextExtractionErrors = false,
SearchForTextRelatedGraphics = false,
UseFontEngineEncoding = false
};
var textExtractionOptions = new TextExtractionOptions(TextExtractionOptions.TextFormattingMode.MemorySaving);
var textAbsorber = new TextAbsorber(textExtractionOptions, textSearchOptions);
foreach (Page page in pdfDocument.Pages)
{
page.Accept(textAbsorber);
if (!string.IsNullOrWhiteSpace(textAbsorber.Text))
{
return true;
}
}
return false;
With this check, it takes 13 to 17 seconds.
Please tell me how can I reduce the time for this check? The best result of 13 seconds is too long for this check.
I attach the file used in the check to the topic.
TestPdfFile.pdf (306.1 KB)