Converting large PDF to HTML doesn't complete

AndrewN · November 6, 2018, 9:50am

Hi,

I am using the following code to convert PDF’s to HTML, the process works extremely well most of the time, however, for large PDF’s (around 30mb) the process never completes. I left the code running for 16 hours before aborting, the process was taking over 4.5gb of memory before aborting.

    /// <summary>
    /// Converts supplied byte array to HTML
    /// </summary>
    private byte[] ConvertPDFToHTML(byte[] fileBytes)
    {
        byteArray = null;
        try
        {
            Document doc = new Document(new MemoryStream(fileBytes));

            HtmlSaveOptions saveOptions = new HtmlSaveOptions();
            saveOptions.FixedLayout = true;
            saveOptions.SplitIntoPages = false;
            saveOptions.LettersPositioningMethod = HtmlSaveOptions.LettersPositioningMethods.UsePixelUnitsInCssLetterSpacingForIE;
            saveOptions.PartsEmbeddingMode = HtmlSaveOptions.PartsEmbeddingModes.EmbedAllIntoHtml;
            saveOptions.RasterImagesSavingMode = HtmlSaveOptions.RasterImagesSavingModes.AsEmbeddedPartsOfPngPageBackground;
            saveOptions.CustomHtmlSavingStrategy = new HtmlSaveOptions.HtmlPageMarkupSavingStrategy(StrategyOfSavingHtml);
            doc.Save("dummy", saveOptions);
            return byteArray;
        }
        catch (Exception e)
        {
            Common.WriteErrorLog("PDFConverter", "ProcessControl.ConvertPDFToHTML(byte[]) failed with error " + e.ToString());
            return null;
        }
    }

    /// <summary>
    /// Used by AsposePDF saveOptions
    /// </summary>
    private void StrategyOfSavingHtml(HtmlSaveOptions.HtmlPageMarkupSavingInfo htmlSavingInfo)
    {
        // extract byte array of HTML document
        System.IO.BinaryReader reader = new BinaryReader(htmlSavingInfo.ContentStream);
        byteArray = reader.ReadBytes((int)htmlSavingInfo.ContentStream.Length);
    }

Can you please advise what may be going wrong?

asad.ali · November 6, 2018, 5:35pm

@AndrewN

Thanks for contacting support.

Would you please share your sample PDF document with us. In case the PDF file is of larger size, you may please upload it to some public file sharer e.g. Dropbox or Google Drive and share the link with us. We will test the scenario in our environment and address it accordingly.

AndrewN · November 7, 2018, 9:52am

I have uploaded a sample file to DropBox, the URL is Dropbox - MarketingOrderStats.pdf - Simplify your life

asad.ali · November 7, 2018, 5:28pm

@AndrewN

Thanks for contacting support.

We were able to observe that process was taking long time with memory hiking upto 4GB using your shared PDF document. Therefore, we have logged a performance issue as PDFNET-45642 in our issue tracking system. We will further look into details of the issue and keep you posted with the status of its correction. Please be patient and spare us little time.

We are sorry for the inconvenience.