Converting 1 bit PNGs to PDFs slows down under sustained load and scales poorly

I’m trying to create a system for bulk converting 1 bit PNGs to single-page PDFs. I used this page as a reference, and I’m passing ImageFilterType.CCITT to the page.Resources.Images.Add() method. (This is the only way I found that would produce reasonable-sized PDFs. Setting IsBlackWhite=true was not enough.)

I created a test that loads 60 pages into memory, converts one page (to warm up the logic), then measures the conversion of those 60 pages 8 times, for a total of 480 conversions per run. (These numbers are from a VM with 8 dedicated CPU cores and hyperthreading disabled.)

1 thread: 3.7 pages/sec, 129.8 sec CPU time, 129.8 sec wall time
2 threads: 3.4 pages/sec/thread, 140.9 sec CPU time, 70.2 sec wall time
3 threads: 2.8 pages/sec/thread, 171.4 sec CPU time, 57.4 sec wall time
4 threads: 2.1 pages/sec/thread, 226.2 sec CPU time, 56.9 sec wall time
5 threads: 1.7 pages/sec/thread, 275.1 sec CPU time, 55.3 sec wall time

Observations:

  • The rapidly-increasing CPU time shows that there is extremely high overhead when adding more threads.
  • While the test is running, the process CPU usage starts out somewhat in proportion to the number of threads, but quickly goes down to about 25% max (except for 1 thread, which naturally stays at 12%).
  • Scaling beyond 3 threads is minimal. And due to the previous observation, I suspect that over a longer run, even 3 threads would fail to show any scaling benefit vs just 2 threads.
  • I ruled out my test logic and virtual environment as a source of error by replacing the PDF conversion with a loop using n = Math.Log(n+1) to burn CPU, and it scales and sustains CPU usage as expected.

Here’s the code I’m using. I don’t have a complete working sample project for you because the images contain protected information. The PNGs are 1 bit, approximately 200 DPI (approx 2500x3400), are 25-124KB in size, and are 4MB in total.

using System.Diagnostics;
using System.Threading.Tasks.Dataflow;
using Pdf = Aspose.Pdf;

new Aspose.Imaging.License().SetLicense("...");

BenchmarkItem conversion = new() { Name = "Convert PNG to PDF" };

var pngsBytes = Repeat(OpenPngs(60).ToArray(), 60 * 8);

var imageFilterType = Pdf.ImageFilterType.CCITTFax;

//warm up logic
{
    using MemoryStream imageStream = new(pngsBytes.First());
    using var pdfStream = StreamToPdf_Operators(imageStream, imageFilterType);
}

const int threads = 5;

var sw = Stopwatch.StartNew();

//using ActionBlock instead of Parallel.ForEach/Async() because ActionBlock starts all threads immediately, Parallel.ForEach* slowly spins up threads
ActionBlock<byte[]> actionBlock = new((pngBytes) =>
{
    using MemoryStream imageStream = new(pngBytes);
    conversion.Run(() =>
    {
        using var pdfStream = StreamToPdf_Operators(imageStream, imageFilterType);
    });
},
new ExecutionDataflowBlockOptions()
{
    BoundedCapacity = threads,
    MaxDegreeOfParallelism = threads,
    EnsureOrdered = false
});

foreach (var pngBytes in pngsBytes)
{
    await actionBlock.SendAsync(pngBytes);
}

actionBlock.Complete();
await actionBlock.Completion;

sw.Stop();

conversion.Print();

Console.WriteLine($"Wall clock time: {sw.Elapsed.TotalSeconds:0.000} seconds");

return;

static IEnumerable<byte[]> OpenPngs(int numPages)
{
    var fileNames = Directory.GetFiles(@"Assets\multipage png", "*.png");
    for (var i = 0; i < numPages; i++)
    {
        var pageNum = i % 60;
        var fileName = fileNames[pageNum];
        var bytes = File.ReadAllBytes(fileName);
        yield return bytes;
    }
}

static IEnumerable<T> Repeat<T>(IEnumerable<T> source, int count)
{
    int i = 0;
    while (true)
    {
        foreach (var item in source)
        {
            if (i >= count)
                yield break;

            yield return item;
            i++;
        }
    }
}

static Stream StreamToPdf_Operators(Stream imageStream, Pdf.ImageFilterType imageFilterType)
{
    //reference: https://docs.aspose.com/pdf/net/add-image-to-existing-pdf-file/

    using var document = new Pdf.Document();

    using var page = document.Pages.Add();

    // Add image to the page's resource collection
    page.Resources.Images.Add(imageStream, imageFilterType);
    Pdf.XImage ximage = page.Resources.Images[1];

    // Set page size
    (double widthPoints, double heightPoints) = GuessSizeInPoints(ximage.Width, ximage.Height);
    page.SetPageSize(widthPoints, heightPoints);

    // Set page margin
    page.PageInfo.Margin = new Pdf.MarginInfo(0, 0, 0, 0);

    // Using GSave operator: this operator saves current graphics state
    page.Contents.Add(new Pdf.Operators.GSave());

    // Create Rectangle and Matrix objects
    Pdf.Matrix matrix = new(new double[] { widthPoints, 0, 0, heightPoints, 0, 0 });

    // Using ConcatenateMatrix (concatenate matrix) operator: defines how image must be placed
    page.Contents.Add(new Pdf.Operators.ConcatenateMatrix(matrix));

    // Using Do operator: this operator draws image
    page.Contents.Add(new Pdf.Operators.Do(ximage.Name));

    // Using GRestore operator: this operator restores graphics state
    page.Contents.Add(new Pdf.Operators.GRestore());


    MemoryStream outStream = new();

    try
    {
        document.Save(outStream);

        //imageStream.Position = 0;
        outStream.Position = 0;

        return outStream;
    }
    catch
    {
        imageStream.Position = 0;

        outStream.Dispose();
        throw;
    }
}

static (double, double) GuessSizeInPoints(int widthPixels, int heightPixels)
{
    //assume that the smaller dimension is 8.5"
    double scale = 8.5 * 72 / Math.Min(widthPixels, heightPixels);
    return (widthPixels * scale, heightPixels * scale);
}

class BenchmarkItem
{
    private readonly object guard = new();

    public string Name { get; set; } = "";

    public int Count { get; private set; }
    public double TimeTaken { get; private set; }

    public void Run(Action action)
    {
        var sw = Stopwatch.StartNew();

        action();

        var runTime = sw.Elapsed.TotalSeconds;

        lock (this.guard)
        {
            this.Count++;
            this.TimeTaken += runTime;
        }

        //Console.WriteLine($"{this.Name} took {runTime:0.000} seconds");
    }

    public async Task RunAsync(Func<Task> action)
    {
        var sw = Stopwatch.StartNew();

        await action();

        var runTime = sw.Elapsed.TotalSeconds;

        lock (this.guard)
        {
            this.Count++;
            this.TimeTaken += runTime;
        }
    }

    public void Print()
    {
        Console.WriteLine($"=== {this.Name} ===");
        Console.WriteLine($"Total CPU time taken: {this.TimeTaken:0.000} seconds");
        Console.WriteLine($"Count: {this.Count}");
        Console.WriteLine($"Throughput: {this.Count / this.TimeTaken:0.000} per second");
    }
}

@CorvelAspose

Please try to use the below code snippet in your program and share with us in case you notice some improvements or nothing changes:

Document doc = new Document();
// Add a page to pages collection of document
Page page = doc.Pages.Add();
// Load the source image file to Stream object
FileStream fs = new FileStream(dataDir + "input.tif", FileMode.Open, FileAccess.Read);
byte[] tmpBytes = new byte[fs.Length];
fs.Read(tmpBytes, 0, int.Parse(fs.Length.ToString()));

MemoryStream mystream = new MemoryStream(tmpBytes);
// Instantiate BitMap object with loaded image stream
System.Drawing.Bitmap b = new System.Drawing.Bitmap(mystream);

// Set margins so image will fit, etc.
page.PageInfo.Margin.Bottom = 0;
page.PageInfo.Margin.Top = 0;
page.PageInfo.Margin.Left = 0;
page.PageInfo.Margin.Right = 0;

page.CropBox = new Aspose.Pdf.Rectangle(0, 0, b.Width, b.Height);
// Create an image object
Aspose.Pdf.Image image1 = new Aspose.Pdf.Image();
// Add the image into paragraphs collection of the section
page.Paragraphs.Add(image1);
// Set the image file stream
image1.ImageStream = mystream;
dataDir = dataDir + "ImageToPDF_out.pdf";
// Save resultant PDF file
doc.Save(dataDir);
// Close memoryStream object
mystream.Close();

First, a correction from a previous statement I made: setting Aspose.Pdf.Image image = ...; image.IsBlackWhite = true; DOES produce good file sizes on Aspose.Pdf 22.11. (Maybe a previous version of the library was not handling this well.)

I went even simpler than your suggestion and used the following as the core conversion logic. It performs a little bit slower now, though, and still has the same thread scaling issue.

static Stream StreamToPdf_BasicWithWrongSize(Stream imageStream)
{
    using var document = new Pdf.Document();
    var page = document.Pages.Add();
    page.SetPageSize(8.5 * 72, 11 * 72);
    page.PageInfo.Margin = new Pdf.MarginInfo(0, 0, 0, 0);

    var img = new Pdf.Image()
    {
        ImageStream = imageStream,
        IsBlackWhite = true, //also tested with this line commented out; performance is the same
        FixWidth = 8.5 * 72,
        FixHeight = 11 * 72
    };

    page.Paragraphs.Add(img);

    MemoryStream outStream = new();

    try
    {
        document.Save(outStream);
        outStream.Position = 0;
        return outStream;
    }
    catch
    {
        outStream.Dispose();
        throw;
    }
}

1 thread: 3.6 pages/sec, 134.6 sec CPU time, 134.6 sec wall time
2 threads: 3.1 pages/sec, 152.1 sec CPU time, 76.2 sec wall time
3 threads: 2.5 pages/sec, 190.5 sec CPU time, 63.8 sec wall time
4 threads: 2.0 pages/sec, 246.0 sec CPU time, 61.8 sec wall time
5 threads: 1.6 pages/sec, 307.0 sec CPU time, 61.7 sec wall time

@CorvelAspose

Can you please share some sample image files for our reference as well? We will test the scenario in our environment and address it accordingly.

I only have one sample image that does not contain protected information, but it should be enough to test with anyway. test.png (100.1 KB)

@CorvelAspose

We are checking it and will get back to you shortly.

Has any progress been made on this issue?

@CorvelAspose

We are afraid that the investigation of this case could not completed yet. The ticket for this case is PDFNET-54041 that was logged in our issue management system and as soon as we make some progress towards its resolution, we will inform you in this forum thread. Please spare us some time.

We apologize for the inconvenience.