PDF JPEG imagem compression issue when exporting to html

abasilio · November 7, 2017, 4:58pm

Hi,

We’re converting a PDF document to HTML in order to be presented to the user in that format. However, when compressing JPEG images, we notice those images don’t become much smaller in size.

For example, if we don’t compress the image, the web request downloads about 15MB.
Compressing the same image with quality 50, the web request downloads about 13MB.

Is there a way to overcome this?

Below you can see the code used to compress all images in the document:

int idx = 1;
foreach (Aspose.Pdf.XImage pageImage in asposePage.Resources.Images)
{
using (MemoryStream imageStream = new MemoryStream())
{
pageImage.Save(imageStream, System.Drawing.Imaging.ImageFormat.Jpeg);
asposePage.Resources.Images.Replace(idx, imageStream, 10);
idx = idx + 1;
}
}

Thanks in advance,

Antonio Basilio

asad.ali · November 7, 2017, 11:30pm

@abasilio

Thanks for contacting support.

Please try to optimize the PDF file size by following the instructions given over “[Optimize PDF Document]((”[[BL]]https://docs.aspose.com/pdf/net/optimize-pdf-document/[[/BL]])" article in API documentation and then convert it to HTML. In case you still experience any issue, please share your sample input PDF file with us, so that we can test the scenario in our environment and address it accordingly.

abasilio · November 8, 2017, 2:47pm

Hi Asad,

Thank you for your prompt reply. I’ve tried your solution but, even though it compresses more than the solution we were using, it still looks too much.

For the pdf file I’m sending to you, originally it has 11 MB. 8 pages with images. When processed, the request size for the first page is 7.3 MB and, for the second page it’s 7 MB. After compressing, those sizes shrink to 4.8 and 4.5 MB. But the main question is how does the first and second pages are bigger than the original full document?

To download the document we used for testing:
http://bit.ly/2hkxjAC

Thanks in advance,

Antonio

asad.ali · November 8, 2017, 9:28pm

@abasilio

Thanks for sharing the sample PDF document.

Would you please add some more details about how you are verifying the Page Size of PDF document by sharing a complete code snippet, you are trying at your end to compress and convert PDF document into HTML. This way we can test the scenario in our environment and address it accordingly.

abasilio · November 9, 2017, 9:44am

Asad,

We’re splitting the PDF in pages and then we save each one of the pages as HTML. Then, we check each html page size using Fiddler. Basically we check the response size each time we open a page.

Below the several code methods used for this:

    private Aspose.Pdf.Document _document;

// Here we set the aspose document. I’ve included the change that I tried following your previous suggestion (commented).
public override void SetDocument(Stream file)
{
file.Position = 0;
Stream = new SmallBlockMemoryStream();
file.CopyTo(Stream);
Stream.Position = 0;
_document = CreateDocument(Stream);
_document.AllowReusePageContent = true;

        //_document.OptimizeResources(new Document.OptimizationOptions()
        //{
        //    LinkDuplcateStreams = true,
        //    RemoveUnusedObjects = true,
        //    RemoveUnusedStreams = true,
        //    CompressImages = true,
        //    ImageQuality = 50
        //});
    }

// This method generates the html for each page. The commented code was another try to compress images. When opening the first page there are 2 requests, the first one includes the thumbnails and the second one doesn’t. We’re considering the second one size in our tests.
private DocumentPreviewPage GetPagePreview(Page asposePage)
{
//int idx = 1;
//foreach (Aspose.Pdf.XImage pageImage in asposePage.Resources.Images)
//{
// using (MemoryStream imageStream = new MemoryStream())
// {
// pageImage.Save(imageStream, System.Drawing.Imaging.ImageFormat.Jpeg);
// asposePage.Resources.Images.Replace(idx, imageStream, 50);
// idx = idx + 1;
// }
//}

        var output = new MemoryStream();
        asposePage.SendTo(new PngDevice(), output);

        output.Position = 0;
        var image = Aspose.Imaging.Image.Load(output);
        byte[] thumbnailBytes;
        var options = new Aspose.Imaging.ImageOptions.PngOptions { ColorType = PngColorType.Truecolor };

        using (var thumbnail = new MemoryStream())
        {

            if (image.Width < image.Height)
                image.ResizeHeightProportionally(300, ResizeType.LanczosResample);
            else
                image.ResizeWidthProportionally(300, ResizeType.LanczosResample);


            image.Save(thumbnail, options);
            thumbnailBytes = thumbnail.ToArray();
        }

        var htmlBytes = GetPageHtml(asposePage);

        var subpath = string.Empty;
        if (FileManager.DefaultFileType == StorageType.PhysicalFile)
            subpath = "previews";

        var file = FileManager.Store(output.ToArray(), FileManager.DefaultFileType, Guid.NewGuid().ToString("N"),
                        ".png", subpath, "");
        return new DocumentPreviewPage
        {
            PreviewContentLink = file.FileHash,
            PageNumber = asposePage.Number,
            Width = image.Width,
            Height = image.Height,
            Thumbnail = thumbnailBytes,
            Html = htmlBytes
        };
    }

Thanks in advance,
Let me know if you have any other questions.

asad.ali · November 9, 2017, 5:14pm

@abasilio

Thanks for sharing the code snippet.

By looking at your code, it seems that you are extracting images from PDF, reducing their size with Aspose.Imaging, generating single Page PDFs and converting them into HTML. I tried to execute your code snippet but did not get much success, as there were some undefined/missing objects and methods in the code snippet.

However, you can also convert PDF into HTML by using HtmlSaveOptions.SplitIntoPages option, provided by Aspose.Pdf API. Please check following code snippet where I have used only Aspose.Pdf for .NET, to convert PDF pages into HTML after optimizing the document.

Document pdf = new Document(dataDir + "teste-1.pdf");
pdf.OptimizeResources(new Document.OptimizationOptions()
{
  RemoveUnusedObjects = true,
  RemoveUnusedStreams = true,
  AllowReusePageContent = true,
  ImageQuality = 50,
  CompressImages = true,//it helps to reduce size but it makes the quality a bit lower
  ResizeImages = true,
  LinkDuplcateStreams = true
});
MemoryStream ms = new MemoryStream();
pdf.Save(ms);
ms.Seek(0, SeekOrigin.Begin);
pdf = new Document(ms);

HtmlSaveOptions saveoptions = new HtmlSaveOptions();
saveoptions.SplitIntoPages = true;
pdf.Save(dataDir + "teste-1_out.html", saveoptions);

For your reference, I have attached first page HTML as well.

Page1_Html.zip (1.3 MB)

Please use suggested approach to convert your PDF into HTML and in case this does not meet your actual requirements or my assumptions are not correct, please share a sample console application which demonstrate complete functionality, you are performing. We will again look into the scenario and share our feedback accordingly.

abasilio · November 9, 2017, 5:35pm

Hi Asad,

Thank you for your reply.

I’ll explain how my code works:

Each PDF is used to generate 2 types of objects for each page. A thumbnail and a html page.
The thumbnail is used for the user to click to open the relevant html page.
The html page is generated directly from the PDF page and not from the thumbnail.
At the end of all this process we are storing every page in a database, not in the file system.

Having said that, how will I be able to get all the document pages in streams instead of html files?

Thanks,
Antonio

asad.ali · November 9, 2017, 10:01pm

@abasilio

Thanks for adding more details to the scenario.

Please check following complete code snippet, in order to save PDF pages into streams instead of HTML files.

private static void OptimizePDF(string dataDir)
{
  Document pdf = new Document(dataDir + "teste-1.pdf");
  pdf.OptimizeResources(new Document.OptimizationOptions()
  {
   RemoveUnusedObjects = true,
   RemoveUnusedStreams = true,
   AllowReusePageContent = true,
   ImageQuality = 50,
   CompressImages = true,//it helps to reduce size but it makes the quality a bit lower
   ResizeImages = true,
   LinkDuplcateStreams = true
  });
  MemoryStream ms = new MemoryStream();
  pdf.Save(ms);
  ms.Seek(0, SeekOrigin.Begin);
  pdf = new Document(ms);

  foreach (Page page in pdf.Pages)
  {
   Document doc = new Document();
   doc.Pages.Add(page);
   HtmlSaveOptions saveoptions = new HtmlSaveOptions();

   saveoptions.RasterImagesSavingMode = HtmlSaveOptions.RasterImagesSavingModes.AsEmbeddedPartsOfPngPageBackground;
   saveoptions.FontSavingMode = HtmlSaveOptions.FontSavingModes.AlwaysSaveAsWOFF;
   saveoptions.PartsEmbeddingMode = HtmlSaveOptions.PartsEmbeddingModes.EmbedAllIntoHtml;
   saveoptions.LettersPositioningMethod = HtmlSaveOptions.LettersPositioningMethods.UseEmUnitsAndCompensationOfRoundingErrorsInCss;

   saveoptions.CustomHtmlSavingStrategy = new HtmlSaveOptions.HtmlPageMarkupSavingStrategy(StrategyOfSavingHtml);
                doc.Save(dataDir + "somenonexistingfile.html", saveoptions);
  }
}

public static void StrategyOfSavingHtml(HtmlSaveOptions.HtmlPageMarkupSavingInfo htmlSavingInfo)
{
  // Get target file name and write content to it
  System.IO.BinaryReader reader = new BinaryReader(htmlSavingInfo.ContentStream);
  byte[] htmlAsByte = reader.ReadBytes((int)htmlSavingInfo.ContentStream.Length);
  Console.WriteLine("Html page processed with handler. Length of page's text in bytes is " + htmlAsByte.Count().ToString());

   // Here You can put code that will save page's HTML to some storage, f.e database
   MemoryStream targetStream = new MemoryStream();
   targetStream.Write(htmlAsByte, 0, htmlAsByte.Length);

   // Just to check if targetStream contains HTML pages
   FileStream fs = new FileStream(@"D:\Recent Working\" + Guid.NewGuid().ToString() + ".html", FileMode.Create);
   targetStream.WriteTo(fs);
   fs.Close();
}

In case of any further assistance, please feel free to contact us.