PDF.Save() Query

gareth064 · June 8, 2020, 12:40pm

Hi

I am using the ,NET PDF library to stitch x number of jpgs into a single PDF and save that PDF into a single directory.

What I have noticed is when running this against a data set which will produce 93k PDF’s, the number of PDF’s created every 10 mins starts to get slower and slower as time goes on.

So at the start of the process it is doing ~900 - 1000 PDF’s every 10 mins. Compare that to only doing 150-250 PDF’s every 10 mins once we have around 40K PDF’s on the directory it is saving them to.

The PDF sizes do not grow as time goes one, they just vary depending of the number of jpgs getting stitched together. But again we are talking anything from 1 to 50 jpgs. So nothing huge.

CPU usage and RAM usage are pretty stable throughout and the disk we are writing to are really fast Datacenter grade SSD’s with a crazy amounts of IOPs which aren’t anywhere near being maxed out at any time.

Its all running on 1 thread, so as it loops to create the PDF’s it is doing so one at a time.

So the only other angle that I can think of is that the PDF saving process slows down due to the number of files in the save target directory.

Can anyone tell me if the Save method does a directory scan of the directory first before saving to it?

Here is my code which does the stitching for a single PDF. It is called within a loop higher up passing in each set of JPGs one at a time.

public class ConvertNewDocumentToPDF
{
    private readonly NewDocument newDocument;
    private readonly Logger log;


    public ConvertNewDocumentToPDF(ConvertThreadInfo info)
    {
        this.newDocument = info.Doc;
        this.log = info.Log;

    }

    public NewDocument Convert()
    {
        Stopwatch timer = new Stopwatch();
        Document pdf = new Document();

        foreach (NewDocumentImage image in newDocument.Images)
        {
            if (!IsValidGDIPlusImage(image.SourcePath))
            {
                log.LogWrite($"Corrupt JPG Found & replaced: { image.SourcePath }");
                image.SourcePath = $@"{ Environment.CurrentDirectory }\CorruptImage.jpg";
                newDocument.HadCorruptJpgs = Enums.Yes;
            }


            Aspose.Pdf.Image img = new Aspose.Pdf.Image
            {
                File = image.SourcePath
            };

            var page = pdf.Pages.Add();
            page.PageInfo.Margin.Bottom = 0;
            page.PageInfo.Margin.Top = 0;
            page.PageInfo.Margin.Left = 0;
            page.PageInfo.Margin.Right = 0;

            page.Paragraphs.Add(img);

        }
        
        string saveDir = newDocument.ExportPath;

        try
        {
            timer.Start();

            pdf.Save(saveDir);

            timer.Stop();
            if (timer.ElapsedMilliseconds > 2000)
            {
            TimeSpan t = TimeSpan.FromMilliseconds(timer.ElapsedMilliseconds);
            string timeToComplete = string.Format("{0:D2}h:{1:D2}m:{2:D2}s:{3:D3}ms",
                                    t.Hours,
                                    t.Minutes,
                                    t.Seconds,
                                    t.Milliseconds);
            FileInfo fi = new FileInfo(saveDir);
            Process currentProcess = Process.GetCurrentProcess();

            System.Diagnostics.ProcessThreadCollection myThreads = currentProcess.Threads;
            log.LogWrite($"Document {newDocument.DocNo} took {timeToComplete} to write {fi.Length} to disk. Number of CPUs: {Environment.ProcessorCount}. Using {myThreads.Count} threads. Thread: {Thread.CurrentThread.ManagedThreadId}");
            }

            newDocument.PageCountNew = pdf.Pages.Count;

            return newDocument;
        }
        catch (Exception e)
        {
            timer.Stop();
            log.LogWrite(e.Message);
            throw e;
        }

    }

    private bool IsValidGDIPlusImage(string filename)
    {
        try
        {
            using (var bmp = new Bitmap(filename))
            {
            }
            return true;
        }
        catch (Exception)
        {
            return false;
        }
    }



}

And here is a screenshot of some diagnostics which shows that on average, the highest cost action in the code is the Save method.

image.png (57.2 KB)

asad.ali · June 8, 2020, 8:41pm

@gareth064

Thanks for your inquiry.

No, the API does not scan the target directory while or before saving the PDF document. It just creates files stream and writes into it.

As per our understandings, you are generating a PDF document by adding 1-50 images into it. One PDF document is being generated in each iteration and every time a PDF is saved, subsequent PDFs take more time to get saved.

If our understandings are correct, please share a sample console application with the code snippet to read image files from a directory and add them to PDF. We will test the scenario with our sample files and share our feedback with you.