Get Images while PDF to HTML Conversion

rchilli · February 20, 2023, 10:35am

Hi,

The images in the PDF are stored at location while converting PDF to HTML.

com.aspose.pdf.Document doc = new com.aspose.pdf.Document(f.getAbsolutePath());
//step 1
doc.save(saveLocation+f.getName()+".html", SaveFormat.Html);

Is there a way to includes the base64 of the images inside the HTML document?

carlos.molina · February 20, 2023, 12:31pm

@rchilli,

This code will allow you to embed the images:

public void Logic(Document doc) throws Exception
{
    var saveOptions = new HtmlSaveOptions();        
            
    saveOptions.setPartsEmbeddingMode(HtmlSaveOptions.PartsEmbeddingModes.EmbedAllIntoHtml);
    saveOptions.setLettersPositioningMethod(LettersPositioningMethods.UseEmUnitsAndCompensationOfRoundingErrorsInCss);
    saveOptions.setRasterImagesSavingMode(HtmlSaveOptions.RasterImagesSavingModes.AsEmbeddedPartsOfPngPageBackground); 

    doc.save(PartialPath + "_output.html", saveOptions);
}

rchilli · February 20, 2023, 1:09pm

I am working with Aspose.PDF for Java, could you please share the code for that?

carlos.molina · February 20, 2023, 1:30pm

@rchilli,

The code is java.

rchilli · February 20, 2023, 1:58pm

var keyword is in java 10, I am using java 8

carlos.molina · February 20, 2023, 2:00pm

@rchilli,

Then replace var for the type that is on the right side of the parenthesis, HtmlSaveOptions.

rchilli · February 20, 2023, 2:30pm

Thanks, I was using the Apose.PDF 21 and this HTML option was not showing over there.
Updated Aspose.PDF version to 23, it is working fine now.

rchilli · March 1, 2023, 8:03am

Hi,

For some of the PDF it is taking time, can you please check and provide the solution.
Rohit.zip (228.3 KB)

rchilli · March 1, 2023, 1:51pm

Can you please check?

carlos.molina · March 1, 2023, 2:03pm

@rchilli,

I am.

So It takes longer than other document because of the multiple amount of items that have to be converted.

Conversion time is not fixed. The more complex your document is it takes more time to be converted.

I run it in my local environment and my code took :

Aspose PDF API is an on-premise software. Which mean all the processes are run locally in the machine is running the program. The faster the machine, the quicker the execution will be.

rchilli · March 2, 2023, 8:48am

Will time reduced if we remove the images while converting to HTML?
If yes, could you please share the code for the same or for remove the images from PDF?

I tried but it is saving the images at the location,
try(com.aspose.pdf.Document doc = new com.aspose.pdf.Document(fileData))
{

			String fileNamePath = GlobalConstants.htmlFolder + File.separator + new Date().getTime() +"_" + resumeInfo.getFileName() +".html";
			doc.save(fileNamePath, SaveFormat.Html);
			
			}
			catch(Exception ex)
			{
				es.printStackTrace();
			}

carlos.molina · March 2, 2023, 12:50pm

@rchilli,

It all depends on what is the objective. If it is to display to clients. I would not suggest removing images, since they will see a different document compared to the original one. So I really think you should not consider this option.
Code Sample:

private void LogicAlt2()
{
    var docWithImages = new Document($"{PartialPath}_input.pdf");
    
    foreach (var page in docWithImages.Pages)
    {
        for(int imageNumber = 0; imageNumber < page.Resources.Images.Count; imageNumber++)
        {
            page.Resources.Images.Delete(1);
        }
    }

    var saveOptions = new HtmlSaveOptions();

    saveOptions.PartsEmbeddingMode = HtmlSaveOptions.PartsEmbeddingModes.EmbedAllIntoHtml;
    saveOptions.LettersPositioningMethod = LettersPositioningMethods.UseEmUnitsAndCompensationOfRoundingErrorsInCss;
    saveOptions.RasterImagesSavingMode = HtmlSaveOptions.RasterImagesSavingModes.AsEmbeddedPartsOfPngPageBackground;

    docWithImages.Save($"{PartialPath}WithoutImage_output.html", saveOptions);
}

If it is for display only, Maybe a solution can be to transform to images. And display an image in the browser instead. This process also takes time.

I do not know your machine specs, but I will give you a couple of example for you to try.

Code Sample:

private void LogicAlt()
{
    var doc = new Document($"{PartialPath}_input.pdf");

    using (PdfConverter converter = new PdfConverter())
    {
        // Set the resolution to 300 DPI
        converter.Resolution = new Resolution(300);

        // Convert the whole PDF file to an image
        converter.BindPdf(doc);
        converter.StartPage = 1;
        converter.EndPage = doc.Pages.Count;
        converter.DoConvert();

        // Save the image
        converter.SaveAsTIFF($"{PartialPath}_output.tiff");

        // Dispose the PdfConverter object
        converter.Dispose();
    }
}

Another code Sample using PNG:

private void LogicAlt2()
{
    var doc = new Document($"{PartialPath}_input.pdf");
    Document newDocWithImages = new Document();

    int resolution = 300;
    PngDevice png = new PngDevice(new Resolution(resolution));

    foreach (Page page in doc.Pages)
    {
        FileStream imageStream = new FileStream($"{PartialPath}_{page.Number}.png", FileMode.OpenOrCreate);
        png.Process(page, imageStream);

        var newPage = newDocWithImages.Pages.Add(page);
        page.Resources.Images.Add(imageStream);

        imageStream.Dispose();
        File.Delete($"{PartialPath}_{page.Number}.png");
    }

    newDocWithImages.Save($"{PartialPath}_output.pdf");

    var saveOptions = new HtmlSaveOptions();

    saveOptions.PartsEmbeddingMode = HtmlSaveOptions.PartsEmbeddingModes.EmbedAllIntoHtml;
    saveOptions.LettersPositioningMethod = LettersPositioningMethods.UseEmUnitsAndCompensationOfRoundingErrorsInCss;
    saveOptions.RasterImagesSavingMode = HtmlSaveOptions.RasterImagesSavingModes.AsEmbeddedPartsOfPngPageBackground;

    newDocWithImages.Save($"{PartialPath}AsImage_output.html", saveOptions);
}