Getting text from PDF using TextAbsorber

Evgeniy991 · February 27, 2024, 6:20am

Hi!

We are trying to extract text from a PDF file. We found that 4.5-5GB of RAM is allocated for a 15MB file. Please help me find the cause of the behavior and reduce RAM consumption.

The following code snippet is used.

    public static string GetContent(byte[] content)
    {
        using (MemoryStream input = new MemoryStream(content))
        using (Document document = new Document(input))
        {
            TextAbsorber textAbsorber = new TextAbsorber();
            document.Pages.Accept(textAbsorber);
            return textAbsorber.Text;
        }
    }

test.pdf (14.7 MB)

andriy.andrukhovski · February 27, 2024, 5:20pm

Hi, @Evgeniy991 !

Please, could you check text extraction with the following options:

var extractionOptions = new TextExtractionOptions(TextExtractionOptions.TextFormattingMode.MemorySaving);
var textAbsorber = new TextAbsorber(extractionOptions);

Evgeniy991 · February 28, 2024, 6:34am

Thank you very much!. This is a working solution