Is there a solution for the Parallel.ForEach issue processing a PDF

simon.fairey · March 25, 2018, 5:31pm

Hi

I’m know that something like this won’t be any faster than a single threaded version:

        Parallel.ForEach(doc.Pages.OfType<Aspose.Pdf.Page>(), (page, state, counter) =>
        {
            using (MemoryStream s = new MemoryStream())
            {
                device.Process(page, s);
                s.Seek(0, SeekOrigin.Begin);
                using (Stream file = File.Create($"{outputFolder}\\{counter}_parallel.png"))
                {
                    s.CopyTo(file);
                }
            }
        });

Is there any way to get the doc.Pages to populate themselves before trying to process them as I believe it’s only when it runs device.Process that it actually reads the data from the page/doc stream. So if we can get the pages to pre-fill with data from the stream then we could parallel process the pages. Or maybe some way for the stream to be sucked into memory?

Thanks

Si

asad.ali · March 25, 2018, 9:54pm

@simon.fairey

Thanks for contacting support.

Would you please share some more details of the scenario like - if API is taking time or CPU usage more than your expectations, in order to process the PDF Pages? As per my understandings, you want to convert all pages of PDF document into PNG image by calling device.Process() method only once. If so is the case, would you please share complete code snippet along with sample PDF document. We will test the scenario in our environment and address it accordingly.

simon.fairey · March 25, 2018, 10:05pm

Hi

What I want is for it to use more than 1 CPU, the issue is because the Aspose code only allows one thread to access the source file at the same time it means that running in parallel is pointless because they all queue up behind each other.

So on an 8 core machine whether I run a normal or parallel for loop the CPU never uses more than about 18% of the overall cores.

If I save the PDF as say 200 single page PDFs and then process each of those files in parallel it runs at 100% CPU and takes a few seconds.

So the issue is how to process a single PDF in parallel and not be restricted to a single thread when reading the PDF.

For example can we copy the pages to new objects that contain all the data and don’t need to read from the PDF to process so that when you parallel process them they can all run at once?

This (single thread):

        var index = 1;
        foreach (Aspose.Pdf.Page page in doc.Pages)
        {
            using (MemoryStream s = new MemoryStream())
            {
                device.Process(page, s);
                s.Seek(0, SeekOrigin.Begin);
                using (Stream file = File.Create($"{outputFolder}\\{index}_single.png"))
                {
                    s.CopyTo(file);
                }
                index++;
            }
        }

and this (multi-thread):

        Parallel.ForEach(doc.Pages.OfType<Aspose.Pdf.Page>(), (page, state, counter) =>
        {
            using (MemoryStream s = new MemoryStream())
            {
                device.Process(page, s);
                s.Seek(0, SeekOrigin.Begin);
                using (Stream file = File.Create($"{outputFolder}\\{counter}_parallel.png"))
                {
                    s.CopyTo(file);
                }
            }
        });

both take the same amount of time regardless of page count.

asad.ali · March 26, 2018, 9:12am

@simon.fairey

Thanks for sharing more details of the scenario.

We have created an investigation ticket as PDFNET-44413 in our issue tracking system, for your requirement. We will further investigate the feasibility of your requirement and keep you posted with the status of progress. Please be patient and spare us little time.

We are sorry for the inconvenience.