Convert Pdf to html string

msingh02 · April 3, 2019, 6:36am

Hi Team,

I was using Aspose.Word for converting word document to HTML string. Can I get a similar example to do the same thing for converting Pdf to Html string?

Below is the example of word document.
private static string WordDocumentToHtml(Stream fileStream)
{
var document = new Aspose.Words.Document(fileStream);
var options = new Aspose.Words.Saving.HtmlSaveOptions() {
ExportImagesAsBase64 = true,
UseHighQualityRendering = true
};
using (var output = new MemoryStream())
{
document.Save(output, options);
var html = Encoding.UTF8.GetString(output.GetBuffer(), 0, (int)output.Length);
return html;
}
}
I need a method to return converted html as string. We need this html string to display on TinyMce editor. One problem which we face is TinyMce not support g and svg elements of html.

Thanks,
Mukesh Singh

Farhan.Raza · April 3, 2019, 2:41pm

@msingh02

Thank you for contacting support.

You may use below code snippet to convert a PDF document to HTML with base64 embedded images using Aspose.PDF for .NET API.

private static string PDFDocumentToHTML(Stream fileStream)
{
    var document = new Aspose.Pdf.Document(fileStream);
    Aspose.Pdf.HtmlSaveOptions htmlOptions = new Aspose.Pdf.HtmlSaveOptions();
    htmlOptions.PartsEmbeddingMode = HtmlSaveOptions.PartsEmbeddingModes.EmbedAllIntoHtml;
    htmlOptions.LettersPositioningMethod = HtmlSaveOptions.LettersPositioningMethods.UseEmUnitsAndCompensationOfRoundingErrorsInCss;
    htmlOptions.RasterImagesSavingMode = HtmlSaveOptions.RasterImagesSavingModes.AsEmbeddedPartsOfPngPageBackground;
    using (var output = new MemoryStream())
    {
        document.Save(output, htmlOptions);
        var html = Encoding.UTF8.GetString(output.GetBuffer(), 0, (int)output.Length);
        return html;
    }
}

We hope this will be helpful. Please feel free to contact us if you need any further assistance.

msingh02 · April 4, 2019, 5:50am

Thanks, @Farhan.Raza,

I tried your example but getting exception on run.

System.ApplicationException: ‘Inconsistent saving options detected : ‘CustomStrategyOfCssUrlCreation’,‘CustomCssSavingStrategy’,‘CustomResourceSavingStrategy’ may not be null when requested saving to stream!’

Please help.

Thanks,
Mukesh Singh

Farhan.Raza · April 4, 2019, 12:28pm

@msingh02

Thank you for the details.

We have investigated the scenario and would like to update you that converting PDF to HTML is possible only with using saveOptions.CustomHtmlSavingStrategy.

Change the saving method document.save(resultOutputStream, saveOptions); as following:

 //            document.save(resultOutputStream, saveOptions);

            final ByteArrayOutputStream _outputHtmlStream = resultOutputStream;
            saveOptions.CustomHtmlSavingStrategy = new HtmlSaveOptions.HtmlPageMarkupSavingStrategy()
            {
                public void invoke(HtmlSaveOptions.HtmlPageMarkupSavingInfo htmlSavingInfo)
                {
                    savingToStream(htmlSavingInfo, _outputHtmlStream);
                }
            };
            String outHtmlFile = System.getProperty("java.io.tmpdir");//Use any directory that exist.
            document.save(outHtmlFile, saveOptions);

And add the savingToStream mathod:

private static void savingToStream(HtmlSaveOptions.HtmlPageMarkupSavingInfo htmlSavingInfo, ByteArrayOutputStream stream)
    {
        /*Byte*/
        byte[] resultHtmlAsBytes;
        try
        {
            resultHtmlAsBytes = new /*Byte*/byte[(htmlSavingInfo.ContentStream.available())];
            htmlSavingInfo.ContentStream.read(resultHtmlAsBytes, 0, resultHtmlAsBytes.length);
            stream.write(resultHtmlAsBytes, 0, resultHtmlAsBytes.length);
        } catch (IOException e)
        {
            e.printStackTrace();
        }
    }

We hope this will resolve the problem you are currently facing. Please let us know if you need any further assistance.

msingh02 · April 5, 2019, 6:15am

@Farhan.Raza Looks like this program is for java. I need a C# example.

Thanks
Mukesh Singh

Farhan.Raza · April 5, 2019, 1:05pm

@msingh02

Kindly visit PDF to HTML - Save HTML, CSS, Image, and Font Resources in Stream Object for .NET version of saving into Stream object.