Pdf to html stream

Hello, we need to convert pdf files in html stream, so we need to return a stream for html and some streams (or one only) with css, fonts etc…
Is there a way to do this? We try some of your examples but we cannot reach our goal
We cannot use any phisical path.
Thanks

Hi Valerio,


Thanks for your inquiry. You can achieve your requirement using existing code sample of saving PDF to HTML to stream objects with usage of custom resource processing strategies. The point is : supply non-existing path to output HTML file and implement custom processing of all resources(HTML markup,CSS,Images,fonts).

saveOptions.CustomHtmlSavingStrategy = new HtmlSaveOptions.HtmlPageMarkupSavingStrategy(StrategyOfSavingHtml);<o:p></o:p>

You must ensure that custom code returns desired URIs of resources. e.g. fonts are referenced in CSS, and to make result CSS referring saved font, saving method must return URI(that will be put into CSS body by converter to refer saved font).


Please check following code snippet, this code supposed to be put in simple console application. Moreover, please pay attention that output HTML is not created anywhere since we should avoid saving of any HTML data on disc as per your requirement - supposed that you will add code that will write content bytes in some another place.

public static void SavingOfAllPageHtmlsApart()

{

Document doc = new Document(@"C:\PDFTest\NimbusSRP.pdf");

// Pay attention that we put non-existing path here : since we use custon resource processing it won't be in use.

// If You forget implement some of required saving strategies(CustomHtmlSavingStrategy,CustomResourceSavingStrategy,CustomCssSavingStrategy), then saving will return "Path not found" exception

string outHtmlFile = @"T:\SomeNonExistingFolder\NimbusSRP.html";

// Create HtmlSaveOption with custom saving strategies that will do all the saving job

// in such approach You can split HTML in pages if You will

HtmlSaveOptions saveOptions = new HtmlSaveOptions();

saveOptions.SplitIntoPages = true;

saveOptions.CustomHtmlSavingStrategy = new HtmlSaveOptions.HtmlPageMarkupSavingStrategy(StrategyOfSavingHtml);

saveOptions.CustomResourceSavingStrategy = new HtmlSaveOptions.ResourceSavingStrategy(CustomSaveOfFontsAndImages);

saveOptions.CustomStrategyOfCssUrlCreation = new HtmlSaveOptions.CssUrlMakingStrategy(CssUrlMakingStrategy);

saveOptions.CustomCssSavingStrategy = new HtmlSaveOptions.CssSavingStrategy(CustomSavingOfCss);

saveOptions.FontSavingMode = HtmlSaveOptions.FontSavingModes.SaveInAllFormats;

saveOptions.RasterImagesSavingMode = HtmlSaveOptions.RasterImagesSavingModes.AsEmbeddedPartsOfPngPageBackground;

doc.Save(outHtmlFile, saveOptions);

Console.WriteLine("Done");

Console.ReadLine();

}

private static void StrategyOfSavingHtml(HtmlSaveOptions.HtmlPageMarkupSavingInfo htmlSavingInfo)

{

//get target file name and write content to it

System.IO.BinaryReader reader = new BinaryReader(htmlSavingInfo.ContentStream);

byte[] htmlAsByte = reader.ReadBytes((int)htmlSavingInfo.ContentStream.Length);

Console.WriteLine("Html page processed with handler. Length of page's text in bytes is " + htmlAsByte.Count().ToString());

// Here You can put code that will save page's HTML to some storage, f.e database

MemoryStream targetStream = new MemoryStream();

targetStream.Write(htmlAsByte, 0, htmlAsByte.Length);

}

private static string CssUrlMakingStrategy(HtmlSaveOptions.CssUrlRequestInfo requestInfo)

{

string template = "style{0}.css";

// one more example of template :

//string template = "http://localhost:24661/document-viewer/GetResourceForHtmlHandler?documentPath=Deutschland201207Arbeit.pdf&resourcePath=style{0}.css&fileNameOnly=false";

return template;

}

private static void CustomSavingOfCss(HtmlSaveOptions.CssSavingInfo resourceInfo)

{

System.IO.BinaryReader reader = new BinaryReader(resourceInfo.ContentStream);

byte[] cssAsBytes = reader.ReadBytes((int)resourceInfo.ContentStream.Length);

Console.WriteLine("Css page processed with handler. Length of css in bytes is " + cssAsBytes.Count().ToString());

// Here You can put code that will save page's HTML to some storage, f.e database

MemoryStream targetStream = new MemoryStream();

targetStream.Write(cssAsBytes, 0, cssAsBytes.Length);

}

private static string CustomSaveOfFontsAndImages(Aspose.Pdf.SaveOptions.ResourceSavingInfo resourceSavingInfo)

{

System.IO.BinaryReader reader = new BinaryReader(resourceSavingInfo.ContentStream);

byte[] resourceAsBytes = reader.ReadBytes((int)resourceSavingInfo.ContentStream.Length);

if (resourceSavingInfo.ResourceType == Aspose.Pdf.SaveOptions.NodeLevelResourceType.Font)

{

Console.WriteLine("Font processed with handler. Length of content in bytes is " + resourceAsBytes.Count().ToString());

// Here You can put code that will save font to some storage, f.e database

MemoryStream targetStream = new MemoryStream();

targetStream.Write(resourceAsBytes, 0, resourceAsBytes.Length);

}

else if (resourceSavingInfo.ResourceType == Aspose.Pdf.SaveOptions.NodeLevelResourceType.Image)

{

Console.WriteLine("Image processed with handler. Length of content in bytes is " + resourceAsBytes.Count().ToString());

// Here You can put code that will save image to some storage, f.e database

MemoryStream targetStream = new MemoryStream();

targetStream.Write(resourceAsBytes, 0, resourceAsBytes.Length);

}

// we should return URI bt which resource will be referenced in CSS(for font)

// or HTML(for images)

// This is very simplistic way - here we just return file name or resource.

// You can put here some URI that will include ID of resource in database etc.

// - this URI will be added into result CSS or HTML to refer the resource

return resourceSavingInfo.SupposedFileName;

}

Please feel free to contact us for any further assistance.


Best Regards,

ok, we’ll try your example… but let me tell, we are searching for a solution in the same way of MS office’s files, so save PDF files as html with images, css and font in WOFF format embedded (as base64)
it’ll be possible with PDF?
thank you.

Hi Valerio,


As per my understanding from your above requirement, you need to render PDF file in HTML with all images, CSS and fonts embedded inside it. If so is the case, then I am pleased to share the Aspose.Pdf for .NET supports this feature. Please try following the details shared by Tilal.

You may also consider visiting following links for more information on Convert PDF File into HTML Format

yes you’re right: we’d like to obtain one html with all parts embedded (woff fonts, css, images…)
We’re trying to use your example and we read documentation but we can’t reach our goal

at the end of the work your method SavingOfAllPageHtmlsApart() execute
doc.Save(outHtmlFile, saveOptions);

it doesn’t save any real file but how can we save the final html converted file?

Our method is the same of your SavingOfAllPageHtmlsApart() (and we use your suggested methods for saving strategy) but at the end we need to return a stream containing the final html with all embedded things (then another object shoud put in an html viewer or something like)
Can you help us?
Thanks a lot

Hi Valerio,


Thanks for your feedback.

we need to convert pdf files in html stream, so we need to return a stream for html and some streams (or one only) with css, fonts etc…

As you stated above in the start of thread, if you want to get separate streams against HTML, CSS, Images and Fonts then please use above suggested code. In that code HTML stream is not being saved in SavingOfAllPageHtmlsApart() but in separate saving strategies (CustomHtmlSavingStrategy,CustomResourceSavingStrategy,CustomCssSavingStrategy).

If you want to get a single HTML stream then please check documentation link for the purpose.Hopefully it will help you to get the desired results.

Best Regards,

We check documentation but noone example matches our scenario: we need to convert pdf in one unique stream with all embedded (like we do with aspose.words for docx files), sorry for previous misunderstanding…

The example in your documentation works but creates all accessory files (images, woff, css…) in phisical path and we can’t do like this, moreover the html stream generated can’t “link” to phisical images, woff and css because of the path.
We need to obtain one stream containing the html code with all embedded in it, we do it with a very small code for aspose.words:
Aspose.Words.Saving.HtmlFixedSaveOptions OptionsW = new Aspose.Words.Saving.HtmlFixedSaveOptions();
Aspose.Words.Saving.CssStyleSheetType.Embedded;
OptionsW.ShowPageBorder = false;
OptionsW.UseAntiAliasing = true;
OptionsW.ExportEmbeddedFonts = true;
OptionsW.ExportEmbeddedCss = true;
OptionsW.ExportEmbeddedImages = true;
((Aspose.Words.Document)m_oDocument).Save(_oOutputStream, OptionsW);


obtaining an html stream file right for our purposes (with fonts, css, images and so on embedded - see this kind of conversion result html code in attachment)

How can we get the same result for PDF files?

Hi Valerio,


Thanks for your feedback. You can embed all resources into a single stream using following code sample. It tunes conversion in such way that all output forced to be embedded into result HTML without external files, and then result HTML is written into some stream with code of custom strategy of saving of HTML. Hopefully it will serve the purpose.


public static void PDFtoHTMLStream()<o:p></o:p>

{

Document doc = new Document(@"F:\ExternalTestsData\36608.pdf");

// tune conversion params

HtmlSaveOptions newOptions = new HtmlSaveOptions();

newOptions.RasterImagesSavingMode = HtmlSaveOptions.RasterImagesSavingModes.AsEmbeddedPartsOfPngPageBackground;

newOptions.FontSavingMode = HtmlSaveOptions.FontSavingModes.SaveInAllFormats;

newOptions.PartsEmbeddingMode = HtmlSaveOptions.PartsEmbeddingModes.EmbedAllIntoHtml;

newOptions.LettersPositioningMethod = HtmlSaveOptions.LettersPositioningMethods.UseEmUnitsAndCompensationOfRoundingErrorsInCss;

newOptions.SplitIntoPages = false;// force write HTMLs of all pages into one output document

newOptions.CustomHtmlSavingStrategy = new HtmlSaveOptions.HtmlPageMarkupSavingStrategy(SavingToStream);

//we can use some non-existing puth as result file name - all real saving will be done

//in our custom method SavingToStream() (it's follows this one)

string outHtmlFile = @"Z:\SomeNonExistingFolder\SomeUnexistingFile.html";

doc.Save(outHtmlFile, newOptions);

}

private static void SavingToStream(HtmlSaveOptions.HtmlPageMarkupSavingInfo htmlSavingInfo)

{

byte[] resultHtmlAsBytes = new byte[htmlSavingInfo.ContentStream.Length];

htmlSavingInfo.ContentStream.Read(resultHtmlAsBytes, 0, resultHtmlAsBytes.Length);

// here You can use any writable strem, file stream is taken just as example

string fileName = @"F:\ExternalTestsData\37544_stream_out.html";

Stream outStream = File.OpenWrite(fileName);

outStream.Write(resultHtmlAsBytes, 0, resultHtmlAsBytes.Length);

}

Please feel free to contact us for any further assistance.


Best Regards,

This works great, thank you!

Only two things more:
1 with this structure we need to use a static stream variable for stream passge from caller object to aspose and return: it will be possible to pass a ref stream as argument to the function SavingToStream(HtmlSaveOptions.HtmlPageMarkupSavingInfo htmlSavingInfo) ?
2 is it possible to changhe some way to set background image as background in htnl (now it saves as img tag and not as background)?

Hi Valerio,

Thanks for your inquiry. It is good to know that you have managed to accomplish PDF to HTML in a single stream.

Moreover, we have logged following two enhancement issues in our issue tracking system to address your requirement for further investigation and resolution. We will keep you updated about the issues' progress within this forum thread.

PDFNEWNET-37952: support of stream parameter in CustomeHtmlSavingStrategy.
PDFNEWNET-37953: set Background image as background html tag.

Best Regards,

For us the best way could be the same code as, for example, aspose.words

Aspose.Words.Saving.HtmlFixedSaveOptions OptionsW = new Aspose.Words.Saving.HtmlFixedSaveOptions();
Aspose.Words.Saving.CssStyleSheetType.Embedded;
OptionsW.ShowPageBorder = false;
OptionsW.UseAntiAliasing = true;
OptionsW.ExportEmbeddedFonts = true;
OptionsW.ExportEmbeddedCss = true;
OptionsW.ExportEmbeddedImages = true;
((Aspose.Words.Document)m_oDocument).Save(_oOutputStream, OptionsW);


that saves output stream with entire html file with all embedded…

thank you

Hi Valerio,


Thanks for you feedback. Aspose.Pdf implementation of saving HTML to PDF in a stream is a generalize solution using saving strategies. Once can save output HTML and resources to a single(embedded) or separate streams according to need. Are you unable to get desired results with above shared Aspose.Pdf for .NET code?

However we have shared your comments with our development team and will keep you updated about the resolution progress of reported issues.

Best Regards,

Hi Valerio,

As a workaround, for the time being you may consider converting PDF files to MS DOC/DOCX format and try using Aspose.Words to accomplish your requirements (as stated in above code). For further details, please visit Convert PDF to DOC or DOCX format

The issues you have found earlier (filed as PDFNEWNET-37952) have been fixed in Aspose.Pdf for .NET 10.0.0.


This message was posted using Notification2Forum from Downloads module by Aspose Notifier.

hi, we just downloaded new version 10.0.0 but we can’t see any new parameter in CustomHtmlSavingStrategy

we use this syntax:

OptionsH.CustomHtmlSavingStrategy = new Aspose.Pdf.HtmlSaveOptions.HtmlPageMarkupSavingStrategy(SavingToStream);

maybe we have to change something… but how?

Hi Valerio,


Thanks for your inquiry. We have investigated your requirement (PDFNEWNET-37952) and would like to update you that signature of SavingToStream method cannot be changed just to add some additional parameter since delegate of this method must be passed to newOptions.CustomHtmlSavingStrategy .

private static void SavingToStream(HtmlSaveOptions.HtmlPageMarkupSavingInfo htmlSavingInfo)<o:p></o:p>

{

}


However, you goal(saving of several output documents to same static output stream) can be easily achieved with existing functionality of Aspose.Pdf. We only need to change a bit custom handler of output saving. Here is code snippet that does that, hopefully it will help you to accomplish the task.


// it can be any writable stream, file stream used only as example

static Stream _staticOutStream = File.OpenWrite(@"F:\ExternalTestsData\static_stream_out.html");

public static void PDFtoStaticHTMLStream_37952()

{

Document doc = new Document(@"F:\ExternalTestsData\HelloWorld.pdf");

// tune conversion params for first saving

HtmlSaveOptions newOptions = new HtmlSaveOptions();

newOptions.RasterImagesSavingMode = HtmlSaveOptions.RasterImagesSavingModes.AsEmbeddedPartsOfPngPageBackground;

newOptions.FontSavingMode = HtmlSaveOptions.FontSavingModes.SaveInAllFormats;

newOptions.PartsEmbeddingMode = HtmlSaveOptions.PartsEmbeddingModes.EmbedAllIntoHtml;

newOptions.LettersPositioningMethod = HtmlSaveOptions.LettersPositioningMethods.UseEmUnitsAndCompensationOfRoundingErrorsInCss;

newOptions.SplitIntoPages = false;// force write HTMLs of all pages into one output document

newOptions.CustomHtmlSavingStrategy = new HtmlSaveOptions.HtmlPageMarkupSavingStrategy(SavingToStaticStream);

//we can use some non-existing puth as result file name - all real saving will be done

//in our custom method SavingToStream() (it's follows this one)

string outHtmlFile = @"Z:\SomeNonExistingFolder\HelloWorld.html";

doc.Save(outHtmlFile, newOptions);

// 2) saving one more document in same stream(saving will really take place in SavingToStaticStream() method)

Document doc_2 = new Document(@"F:\ExternalTestsData\Test1.pdf");

// 2.1)tune conversion params

HtmlSaveOptions newOptions2 = new HtmlSaveOptions();

newOptions2.RasterImagesSavingMode = HtmlSaveOptions.RasterImagesSavingModes.AsEmbeddedPartsOfPngPageBackground;

newOptions2.FontSavingMode = HtmlSaveOptions.FontSavingModes.SaveInAllFormats;

newOptions2.PartsEmbeddingMode = HtmlSaveOptions.PartsEmbeddingModes.EmbedAllIntoHtml;

newOptions2.LettersPositioningMethod = HtmlSaveOptions.LettersPositioningMethods.UseEmUnitsAndCompensationOfRoundingErrorsInCss;

newOptions2.SplitIntoPages = false;// force write HTMLs of all pages into one output document

newOptions2.CustomHtmlSavingStrategy = new HtmlSaveOptions.HtmlPageMarkupSavingStrategy(SavingToStaticStream);

// 2.2)start saving itself

outHtmlFile = @"Z:\SomeNonExistingFolder\Test1.html";

doc_2.Save(outHtmlFile, newOptions);

//

Console.ReadKey();

}

private static void SavingToStaticStream(HtmlSaveOptions.HtmlPageMarkupSavingInfo htmlSavingInfo)

{

Console.WriteLine("Starting saving to static stream of output HTML document '" + htmlSavingInfo.SupposedFileName + "' ...");

byte[] resultHtmlAsBytes = new byte[htmlSavingInfo.ContentStream.Length];

htmlSavingInfo.ContentStream.Read(resultHtmlAsBytes, 0, resultHtmlAsBytes.Length);

// locking allows to ensure that saving to static stream

// goes from one thread a time and allows avoid interference

// between different threads(if any) during saving to same output thread

lock (_staticOutStream)

{

_staticOutStream.Write(resultHtmlAsBytes, 0, resultHtmlAsBytes.Length);

}

Console.WriteLine("Output HTML document '" + htmlSavingInfo.SupposedFileName + "' has been successfully saved to static stream.");

}

Please feel free to contact us for any further assistance.


Best Regards,



Yes, this is the method we use…

our two requests was:
PDFNEWNET-37952: support of stream parameter in CustomeHtmlSavingStrategy.
PDFNEWNET-37953: set Background image as background html tag.

when you wrote:
The issues you have found earlier (filed as PDFNEWNET-37952) have been fixed in Aspose.Pdf for .NET 10.0.0.

we thought was solved inserting stream parameter in Aspose.Pdf.HtmlSaveOptions.HtmlPageMarkupSavingStrategy
avoiding a static member…

so 37952 is solved with no changes and 37953 is still under development? right?

thank you

Hi Valerio,


Thanks for your inquiry. Yes you are right regarding PDFNEWNET-37952 we have suggested you a workaround and shared a sample code. However PDFNEWNET-37953 is pending for investigation, we have requested our development team to complete the investigation at their earliest and share their findings. We will keep you updated about the issue resolution.

Thanks for your patience and cooperation.

Best Regards,

Hi, any news about PDFNEWNET-37953 ?

Hi Valerio,


Thanks for your patience.

I am afraid the issue reported earlier is still not resolved but I have intimated the product team to share the possible timelines by which it will be resolved. A soon as we have some definite updates regarding its resolution, we will let you know.

Your patience and comprehension is greatly appreciated in this regard.