We're sorry Aspose doesn't work properply without JavaScript enabled.

Free Support Forum - aspose.com

Slow check for the existence of text in PDF file

Hi!

I faced out with a following problem.
I have two ways to check if PDF file contains any text. For the file with size 307 KB this takes from 14 to 55 seconds.

First way from documentation (https://docs.aspose.com/pdf/net/find-whether-pdf-file-contains-images-or-text-only/):

        var ms = new MemoryStream();
        var extractor = new PdfExtractor();

        extractor.BindPdf(pdfStream);
        extractor.ExtractText();
        extractor.GetText(ms);

        bool containsText = ms.Length >= 1;

With this check, it takes 40 to 55 seconds.

The second way I wrote myself

        Document pdfDocument = new Document(pdfStream);

        var textSearchOptions = new TextSearchOptions(false)
        {
            IgnoreResourceFontErrors = true,
            IgnoreShadowText = true,
            IsRegularExpressionUsed = false,
            LogTextExtractionErrors = false,
            SearchForTextRelatedGraphics = false,
            UseFontEngineEncoding = false
        };
        var textExtractionOptions = new TextExtractionOptions(TextExtractionOptions.TextFormattingMode.MemorySaving);

        var textAbsorber = new TextAbsorber(textExtractionOptions, textSearchOptions);
        
        foreach (Page page in pdfDocument.Pages)
        {
            page.Accept(textAbsorber);

            if (!string.IsNullOrWhiteSpace(textAbsorber.Text))
            {
                return true;
            }
        }

        return false;

With this check, it takes 13 to 17 seconds.

Please tell me how can I reduce the time for this check? The best result of 13 seconds is too long for this check.
I attach the file used in the check to the topic.
TestPdfFile.pdf (306.1 KB)

@goshafb4

Please note that the performance of the API is measured on the basis of subsequent runs. The API loads necessary resources into memory at the first run which causes some delay in the execution of the functionality. Once the resources are loaded into memory, the performance time gets improved on second and subsequent runs.

We tested the case with your file and noticed that API took 13 seconds at the first run of the program. While at the subsequent runs it took 3 seconds for second approach and 5 seconds for first approach. Please test the API as per our feedback and let us know in case you still have any concerns.

We are developing a web application. This check is part of the PDF file validation on the client side. The fact is that our end user uploads the file one time. And he will not be able to feel the performance gain on subsequent downloads.
Or did you mean that resources are loaded in memory and refer to all requests from different users?

@goshafb4

We need to further investigate according to the environment in which you are using the API. We have logged an investigation ticket as PDFNET-50402 in our issue tracking system to further investigate the whole case. We will look into its details and keep you posted with the status of ticket resolution. Please be patient and spare us some time.

We are sorry for the inconvenience.