Pdf to html stream

BooleServer · December 5, 2014, 5:12am

Hello, we need to convert pdf files in html stream, so we need to return a stream for html and some streams (or one only) with css, fonts etc…
Is there a way to do this? We try some of your examples but we cannot reach our goal
We cannot use any phisical path.
Thanks

tilal.ahmad · December 7, 2014, 11:22pm

Hi Valerio,

Thanks for your inquiry. You can achieve your requirement using existing code sample of saving PDF to HTML to stream objects with usage of custom resource processing strategies. The point is:

Supply non-existing path to output HTML file and implement custom processing of all resources(HTML markup,CSS,Images,fonts).

Details can be found in [articles about usage of custom resource processing strategies](After review of issue please inform the customer through the forum. Also please pay attention and also tell to the customer that in case of usage of custom strategies of resource saving customer must ensure that custom code returns desired URIs of resources. For example, fonts are referenced in CSS, and to make result CSS referring saved font, saving method must return URI (that will be put into CSS body by converter to refer saved font)).

You must ensure that custom code returns desired URIs of resources. For example, fonts are referenced in CSS, and to make result CSS referring saved font, saving method must return URI(that will be put into CSS body by converter to refer saved font).

Please check the following code snippet; this code is supposed to be put in a simple console application. Moreover, please pay attention that output HTML is not created anywhere since we should avoid saving any HTML data on disc as per your requirement - it is supposed that you will add code that will write content bytes in some another place.

SaveOptions CustomHtmlSavingStrategy and HtmlSaveOptions configuration

public static void SavingOfAllPageHtmlsApart()
{
    Document doc = new Document(@"C:\PDFTest\NimbusSRP.pdf");

    // Pay attention that we put the non-existing path here since
    // we use customized resource processing and it won't be in use.
    // If you forget to implement some of the required saving
    // strategies(CustomHtmlSavingStrategy,CustomResourceSavingStrategy,CustomCssSavingStrategy),
    // then saving will return "Path not found" exception
    string outHtmlFile = @"T:\SomeNonExistingFolder\NimbusSRP.html";

    // Create HtmlSaveOption with custom saving strategies that
    // will do all the saving job in HTML to stream objects
    HtmlSaveOptions saveOptions = new HtmlSaveOptions();
    saveOptions.SplitIntoPages = true;

    saveOptions.CustomHtmlSavingStrategy = new HtmlSaveOptions.HtmlPageMarkupSavingStrategy(StrategyOfSavingHtml);
    saveOptions.CustomResourceSavingStrategy = new HtmlSaveOptions.ResourceSavingStrategy(CustomSaveOfFontsAndImages);
    saveOptions.CustomStrategyOfCssUrlCreation = new HtmlSaveOptions.CssUrlMakingStrategy(CssUrlMakingStrategy);
    saveOptions.CustomCssSavingStrategy = new HtmlSaveOptions.CssSavingStrategy(CustomSavingOfCss);

    saveOptions.FontSavingMode = HtmlSaveOptions.FontSavingModes.SaveInAllFormats;
    saveOptions.RasterImagesSavingMode = HtmlSaveOptions.RasterImagesSavingModes.AsEmbeddedPartsOfPngPageBackground;

    doc.Save(outHtmlFile, saveOptions);

    Console.WriteLine("Done");
    Console.ReadLine();
}

private static void StrategyOfSavingHtml(HtmlSaveOptions.HtmlPageMarkupSavingInfo htmlSavingInfo)
{
    // get target file name and write content to it
    System.IO.BinaryReader reader = new System.IO.BinaryReader(htmlSavingInfo.ContentStream);
    byte[] htmlAsByte = reader.ReadBytes((int)htmlSavingInfo.ContentStream.Length);
    Console.WriteLine("Html page processed with handler. Length of page's text is bytes is " + htmlAsByte.Count().ToString());

    // Here You can put code that will save page's HTML to some storage, f.e database
    System.IO.MemoryStream targetStream = new System.IO.MemoryStream();
    targetStream.Write(htmlAsByte, 0, htmlAsByte.Length);
}

private static string CssUrlMakingStrategy(HtmlSaveOptions.CssUrlRequestInfo requestInfo)
{
    string template = "style{0}.css";
    // one more example of template :
    //string template = "http://localhost:24661/document-viewer/GetResourceForHtmlHandler?documentPath=Deutschland201207Arbeit.pdf&resourcePath=style{0}.css&fileNameOnly=false";
    return template;
}

private static void CustomSavingOfCss(HtmlSaveOptions.CssSavingInfo resourceInfo)
{
    System.IO.BinaryReader reader = new System.IO.BinaryReader(resourceInfo.ContentStream);
    byte[] cssAsBytes = reader.ReadBytes((int)resourceInfo.ContentStream.Length);
    Console.WriteLine("Css page processed with handler. Length of CSS is bytes is " + cssAsBytes.Count().ToString());

    // Here You can put code that will save page's HTML to some storage, f.e database
    System.IO.MemoryStream targetStream = new System.IO.MemoryStream();
    targetStream.Write(cssAsBytes, 0, cssAsBytes.Length);
}

private static string CustomSaveOfFontsAndImages(Aspose.Pdf.SaveOptions.ResourceSavingInfo resourceSavingInfo)
{
    // implementation omitted
    System.IO.BinaryReader reader = new System.IO.BinaryReader(resourceSavingInfo.ContentStream);
    byte[] resourceAsBytes = reader.ReadBytes((int)resourceSavingInfo.ContentStream.Length);

    if (resourceSavingInfo.ResourceType == Aspose.Pdf.SaveOptions.NodeLevelResourceType.Font)
    {
        Console.WriteLine("Font processed with handler. Length of content in bytes is " + resourceAsBytes.Count().ToString());
        // Here You can put code that will save font to some storage, f.e database
        System.IO.MemoryStream targetStream = new System.IO.MemoryStream();
        targetStream.Write(resourceAsBytes, 0, resourceAsBytes.Length);
    }
    else if (resourceSavingInfo.ResourceType == Aspose.Pdf.SaveOptions.NodeLevelResourceType.Image)
    {
        Console.WriteLine("Image processed with handler. Length of content in bytes is " + resourceAsBytes.Count().ToString());
        // Here You can put code that will save image to some storage, f.e database
        System.IO.MemoryStream targetStream = new System.IO.MemoryStream();
        targetStream.Write(resourceAsBytes, 0, resourceAsBytes.Length);
    }

    // we should return URI by which resource will be referenced in CSS(for font)
    // or HTML(for images) This is a very simplistic way - here we just return
    // file name or resource.
    // You can put here some URI that will include ID of resource in database etc.
    // - this URI will be added into result CSS or HTML to refer the resource
    return resourceSavingInfo.SupposedFileName;
}

Please feel free to contact us for any further assistance.

Best Regards

BooleServer · December 9, 2014, 10:58am

ok, we’ll try your example… but let me tell, we are searching for a solution in the same way of MS office’s files, so save PDF files as html with images, css and font in WOFF format embedded (as base64)
it’ll be possible with PDF?
thank you.

codewarior · December 10, 2014, 5:32am

Hi Valerio,

As per my understanding from your above requirement, you need to render PDF file in HTML with all images, CSS and fonts embedded inside it. If so is the case, then I am pleased to share the Aspose.Pdf for .NET supports this feature. Please try following the details shared by Tilal.

You may also consider visiting following links for more information on Convert PDF File into HTML Format

BooleServer · December 10, 2014, 9:44am

yes you’re right: we’d like to obtain one html with all parts embedded (woff fonts, css, images…)
We’re trying to use your example and we read documentation but we can’t reach our goal

at the end of the work your method SavingOfAllPageHtmlsApart() execute
doc.Save(outHtmlFile, saveOptions);

it doesn’t save any real file but how can we save the final html converted file?

Our method is the same of your SavingOfAllPageHtmlsApart() (and we use your suggested methods for saving strategy) but at the end we need to return a stream containing the final html with all embedded things (then another object shoud put in an html viewer or something like)
Can you help us?
Thanks a lot

tilal.ahmad · December 11, 2014, 7:58am

Hi Valerio,

Thanks for your feedback.

“we need to convert pdf files in html stream, so we need to return a stream for html and some streams (or one only) with css, fonts etc…”

As you stated above in the start of thread, if you want to get separate streams against HTML, CSS, Images and Fonts then please use above suggested code. In that code HTML stream is not being saved in SavingOfAllPageHtmlsApart() but in separate saving strategies (CustomHtmlSavingStrategy,CustomResourceSavingStrategy,CustomCssSavingStrategy).

If you want to get a single HTML stream then please check documentation link for the purpose.Hopefully it will help you to get the desired results.

Best Regards,

BooleServer · December 11, 2014, 9:25am

We check documentation but noone example matches our scenario: we need to convert pdf in one unique stream with all embedded (like we do with aspose.words for docx files), sorry for previous misunderstanding…

The example in your documentation works but creates all accessory files (images, woff, css…) in phisical path and we can’t do like this, moreover the html stream generated can’t “link” to phisical images, woff and css because of the path.
We need to obtain one stream containing the html code with all embedded in it, we do it with a very small code for aspose.words:
Aspose.Words.Saving.HtmlFixedSaveOptions OptionsW = new Aspose.Words.Saving.HtmlFixedSaveOptions();
Aspose.Words.Saving.CssStyleSheetType.Embedded;
OptionsW.ShowPageBorder = false;
OptionsW.UseAntiAliasing = true;
OptionsW.ExportEmbeddedFonts = true;
OptionsW.ExportEmbeddedCss = true;
OptionsW.ExportEmbeddedImages = true;
((Aspose.Words.Document)m_oDocument).Save(_oOutputStream, OptionsW);

obtaining an html stream file right for our purposes (with fonts, css, images and so on embedded - see this kind of conversion result html code in attachment)

How can we get the same result for PDF files?

tilal.ahmad · December 12, 2014, 7:52am

Hi Valerio,

Thanks for your feedback. You can embed all resources into a single stream using the following code sample. It tunes conversion in such a way that all output is forced to be embedded into the result HTML without external files, and then the result HTML is written into some stream with the code of a custom strategy for saving HTML. Hopefully, it will serve the purpose.

public static void PDFtoHTMLStream()
{
    Document doc = new Document(@"F:\ExternalTestsData\36608.pdf");

    // tune conversion params
    HtmlSaveOptions newOptions = new HtmlSaveOptions();
    newOptions.RasterImagesSavingMode = HtmlSaveOptions.RasterImagesSavingModes.AsEmbeddedPartsOfPngPageBackground;
    newOptions.FontSavingMode = HtmlSaveOptions.FontSavingModes.SaveInAllFormats;
    newOptions.PartsEmbeddingMode = HtmlSaveOptions.PartsEmbeddingModes.EmbedAllIntoHtml;
    newOptions.LettersPositioningMethod = HtmlSaveOptions.LettersPositioningMethods.UseEmUnitsAndCompensationOfRoundingErrorsInCss;
    newOptions.SplitIntoPages = false; // force write HTMLs of all pages into one output document

    newOptions.CustomHtmlSavingStrategy = new HtmlSaveOptions.HtmlPageMarkupSavingStrategy(SavingToStream);

    // we can use some non-existing path as result file name - all real saving will be done
    // in our custom method SavingToStream() (it's follows this one)
    string outHtmlFile = @"Z:\SomeNonExistingFolder\SomeUnexistingFile.html";
    doc.Save(outHtmlFile, newOptions);
}

private static void SavingToStream(HtmlSaveOptions.HtmlPageMarkupSavingInfo htmlSavingInfo)
{
    byte[] resultHtmlAsBytes = new byte[htmlSavingInfo.ContentStream.Length];
    htmlSavingInfo.ContentStream.Read(resultHtmlAsBytes, 0, resultHtmlAsBytes.Length);

    // here You can use any writable stream, file stream is taken just as example
    string fileName = @"F:\ExternalTestsData\37544_stream_out.html";
    Stream outStream = File.OpenWrite(fileName);
    outStream.Write(resultHtmlAsBytes, 0, resultHtmlAsBytes.Length);
}

Please feel free to contact us for any further assistance.

Best Regards,

BooleServer · December 15, 2014, 9:05am

This works great, thank you!

Only two things more:
1 with this structure we need to use a static stream variable for stream passge from caller object to aspose and return: it will be possible to pass a ref stream as argument to the function SavingToStream(HtmlSaveOptions.HtmlPageMarkupSavingInfo htmlSavingInfo) ?
2 is it possible to changhe some way to set background image as background in htnl (now it saves as img tag and not as background)?

tilal.ahmad · December 16, 2014, 7:45am

Hi Valerio,

Thanks for your inquiry. It is good to know that you have managed to accomplish PDF to HTML in a single stream.

Moreover, we have logged following two enhancement issues in our issue tracking system to address your requirement for further investigation and resolution. We will keep you updated about the issues' progress within this forum thread.

PDFNEWNET-37952: support of stream parameter in CustomeHtmlSavingStrategy.

PDFNEWNET-37953: set Background image as background html tag.

Best Regards,

BooleServer · December 19, 2014, 10:18am

For us the best way could be the same code as, for example, aspose.words

Aspose.Words.Saving.HtmlFixedSaveOptions OptionsW = new Aspose.Words.Saving.HtmlFixedSaveOptions();
Aspose.Words.Saving.CssStyleSheetType.Embedded;
OptionsW.ShowPageBorder = false;
OptionsW.UseAntiAliasing = true;
OptionsW.ExportEmbeddedFonts = true;
OptionsW.ExportEmbeddedCss = true;
OptionsW.ExportEmbeddedImages = true;
((Aspose.Words.Document)m_oDocument).Save(_oOutputStream, OptionsW);

that saves output stream with entire html file with all embedded…

thank you

tilal.ahmad · December 22, 2014, 12:54am

Hi Valerio,

Thanks for you feedback. Aspose.Pdf implementation of saving HTML to PDF in a stream is a generalize solution using saving strategies. Once can save output HTML and resources to a single(embedded) or separate streams according to need. Are you unable to get desired results with above shared Aspose.Pdf for .NET code?

However we have shared your comments with our development team and will keep you updated about the resolution progress of reported issues.

Best Regards,

codewarior · December 22, 2014, 2:37am

Hi Valerio,

As a workaround, for the time being you may consider converting PDF files to MS DOC/DOCX format and try using Aspose.Words to accomplish your requirements (as stated in above code). For further details, please visit Convert PDF to DOC or DOCX format

aspose.notifier · January 14, 2015, 5:45am

The issues you have found earlier (filed as PDFNEWNET-37952) have been fixed in Aspose.Pdf for .NET 10.0.0.

This message was posted using Notification2Forum from Downloads module by Aspose Notifier.

BooleServer · January 14, 2015, 7:53am

hi, we just downloaded new version 10.0.0 but we can’t see any new parameter in CustomHtmlSavingStrategy

we use this syntax:

OptionsH.CustomHtmlSavingStrategy = new Aspose.Pdf.HtmlSaveOptions.HtmlPageMarkupSavingStrategy(SavingToStream);

maybe we have to change something… but how?

tilal.ahmad · January 14, 2015, 11:08am

Hi Valerio,

Thanks for your inquiry. We have investigated your requirement (PDFNEWNET-37952) and would like to update you that the signature of the SavingToStream method cannot be changed just to add some additional parameter since delegate of this method must be passed to newOptions.CustomHtmlSavingStrategy.

private static void SavingToStream(HtmlSaveOptions.HtmlPageMarkupSavingInfo htmlSavingInfo)
{
}

However, your goal (saving of several output documents to the same static output stream) can be easily achieved with existing functionality of Aspose.Pdf. We only need to change a bit custom handler of output saving. Here is code snippet that does that, hopefully it will help you to accomplish the task.

// It can be any writable stream, file stream used only as example
static Stream _staticOutStream = File.OpenWrite(@"F:\ExternalTestsData\static_stream_out.html");

public static void PDFtoStaticHTMLStream_37952()
{
    Document doc = new Document(@"F:\ExternalTestsData\HelloWorld.pdf");

    // Tune conversion params for first saving
    HtmlSaveOptions newOptions = new HtmlSaveOptions();
    newOptions.RasterImagesSavingMode = HtmlSaveOptions.RasterImagesSavingModes.AsEmbeddedPartsOfPngPageBackground;
    newOptions.FontSavingMode = HtmlSaveOptions.FontSavingModes.SaveInAllFormats;
    newOptions.PartsEmbeddingMode = HtmlSaveOptions.PartsEmbeddingModes.EmbedAllIntoHtml;
    newOptions.LettersPositioningMethod = HtmlSaveOptions.LettersPositioningMethods.UseEmUnitsAndCompensationOfRoundingErrorsInCss;
    newOptions.SplitIntoPages = false; // force write HTMLs of all pages into one output document
    newOptions.CustomHtmlSavingStrategy = new HtmlSaveOptions.HtmlPageMarkupSavingStrategy(SavingToStaticStream);

    // We can use some non-existing path as result file name - all real saving will be done
    // in our custom method SavingToStream() (it's follows this one)
    string outHtmlFile = @"Z:\SomeNonExistingFolder\HelloWorld.html";
    doc.Save(outHtmlFile, newOptions);

    // 2) saving one more document in same stream (saving will really take place in SavingToStaticStream() method)
    Document doc_2 = new Document(@"F:\ExternalTestsData\Test1.pdf");

    // 2.1) tune conversion params
    HtmlSaveOptions newOptions2 = new HtmlSaveOptions();
    newOptions2.RasterImagesSavingMode = HtmlSaveOptions.RasterImagesSavingModes.AsEmbeddedPartsOfPngPageBackground;
    newOptions2.FontSavingMode = HtmlSaveOptions.FontSavingModes.SaveInAllFormats;
    newOptions2.PartsEmbeddingMode = HtmlSaveOptions.PartsEmbeddingModes.EmbedAllIntoHtml;
    newOptions2.LettersPositioningMethod = HtmlSaveOptions.LettersPositioningMethods.UseEmUnitsAndCompensationOfRoundingErrorsInCss;
    newOptions2.SplitIntoPages = false; // force write HTMLs of all pages into one output document
    newOptions2.CustomHtmlSavingStrategy = new HtmlSaveOptions.HtmlPageMarkupSavingStrategy(SavingToStaticStream);

    // 2.2) start saving itself
    outHtmlFile = @"Z:\SomeNonExistingFolder\Test1.html";
    doc_2.Save(outHtmlFile, newOptions);

    Console.ReadKey();
}

private static void SavingToStaticStream(HtmlSaveOptions.HtmlPageMarkupSavingInfo htmlSavingInfo)
{
    Console.WriteLine("Starting saving to static stream of output HTML document '" + htmlSavingInfo.SupposedFileName + "' ...");
    byte[] resultHtmlAsBytes = new byte[htmlSavingInfo.ContentStream.Length];
    htmlSavingInfo.ContentStream.Read(resultHtmlAsBytes, 0, resultHtmlAsBytes.Length);

    // Locking allows to ensure that saving to static stream
    // goes from one thread at a time and allows avoid
    // interference
    // between different threads(if any) during saving to same
    // output thread
    lock (_staticOutStream)
    {
        _staticOutStream.Write(resultHtmlAsBytes, 0, resultHtmlAsBytes.Length);
    }
    Console.WriteLine("Output HTML document '" + htmlSavingInfo.SupposedFileName + "' has been successfully saved to static stream.");
}

Please feel free to contact us for any further assistance.

Best Regards,

BooleServer · January 15, 2015, 5:55am

Yes, this is the method we use…

our two requests was:
PDFNEWNET-37952: support of stream parameter in CustomeHtmlSavingStrategy.
PDFNEWNET-37953: set Background image as background html tag.

when you wrote:
The issues you have found earlier (filed as PDFNEWNET-37952) have been fixed in Aspose.Pdf for .NET 10.0.0.

we thought was solved inserting stream parameter in Aspose.Pdf.HtmlSaveOptions.HtmlPageMarkupSavingStrategy
avoiding a static member…

so 37952 is solved with no changes and 37953 is still under development? right?

thank you

tilal.ahmad · January 15, 2015, 9:38am

Hi Valerio,

Thanks for your inquiry. Yes you are right regarding PDFNEWNET-37952 we have suggested you a workaround and shared a sample code. However PDFNEWNET-37953 is pending for investigation, we have requested our development team to complete the investigation at their earliest and share their findings. We will keep you updated about the issue resolution.

Thanks for your patience and cooperation.

Best Regards,

BooleServer · January 13, 2016, 10:06am

Hi, any news about PDFNEWNET-37953 ?

codewarior · January 14, 2016, 5:13am

Hi Valerio,

Thanks for your patience.

I am afraid the issue reported earlier is still not resolved but I have intimated the product team to share the possible timelines by which it will be resolved. A soon as we have some definite updates regarding its resolution, we will let you know.

Your patience and comprehension is greatly appreciated in this regard.