Hello, we need to convert pdf files in html stream, so we need to return a stream for html and some streams (or one only) with css, fonts etc…
Is there a way to do this? We try some of your examples but we cannot reach our goal
We cannot use any phisical path.
Thanks
Hi Valerio,
Thanks for your inquiry. You can achieve your requirement using existing code sample of saving PDF to HTML to stream objects with usage of custom resource processing strategies. The point is:
Supply non-existing path to output HTML file and implement custom processing of all resources(HTML markup,CSS,Images,fonts).
Details can be found in [articles about usage of custom resource processing strategies](After review of issue please inform the customer through the forum. Also please pay attention and also tell to the customer that in case of usage of custom strategies of resource saving customer must ensure that custom code returns desired URIs of resources. For example, fonts are referenced in CSS, and to make result CSS referring saved font, saving method must return URI (that will be put into CSS body by converter to refer saved font)).
You must ensure that custom code returns desired URIs of resources. For example, fonts are referenced in CSS, and to make result CSS referring saved font, saving method must return URI(that will be put into CSS body by converter to refer saved font).
Please check the following code snippet; this code is supposed to be put in a simple console application. Moreover, please pay attention that output HTML is not created anywhere since we should avoid saving any HTML data on disc as per your requirement - it is supposed that you will add code that will write content bytes in some another place.
SaveOptions CustomHtmlSavingStrategy and HtmlSaveOptions configuration
public static void SavingOfAllPageHtmlsApart()
{
Document doc = new Document(@"C:\PDFTest\NimbusSRP.pdf");
// Pay attention that we put the non-existing path here since
// we use customized resource processing and it won't be in use.
// If you forget to implement some of the required saving
// strategies(CustomHtmlSavingStrategy,CustomResourceSavingStrategy,CustomCssSavingStrategy),
// then saving will return "Path not found" exception
string outHtmlFile = @"T:\SomeNonExistingFolder\NimbusSRP.html";
// Create HtmlSaveOption with custom saving strategies that
// will do all the saving job in HTML to stream objects
HtmlSaveOptions saveOptions = new HtmlSaveOptions();
saveOptions.SplitIntoPages = true;
saveOptions.CustomHtmlSavingStrategy = new HtmlSaveOptions.HtmlPageMarkupSavingStrategy(StrategyOfSavingHtml);
saveOptions.CustomResourceSavingStrategy = new HtmlSaveOptions.ResourceSavingStrategy(CustomSaveOfFontsAndImages);
saveOptions.CustomStrategyOfCssUrlCreation = new HtmlSaveOptions.CssUrlMakingStrategy(CssUrlMakingStrategy);
saveOptions.CustomCssSavingStrategy = new HtmlSaveOptions.CssSavingStrategy(CustomSavingOfCss);
saveOptions.FontSavingMode = HtmlSaveOptions.FontSavingModes.SaveInAllFormats;
saveOptions.RasterImagesSavingMode = HtmlSaveOptions.RasterImagesSavingModes.AsEmbeddedPartsOfPngPageBackground;
doc.Save(outHtmlFile, saveOptions);
Console.WriteLine("Done");
Console.ReadLine();
}
private static void StrategyOfSavingHtml(HtmlSaveOptions.HtmlPageMarkupSavingInfo htmlSavingInfo)
{
// get target file name and write content to it
System.IO.BinaryReader reader = new System.IO.BinaryReader(htmlSavingInfo.ContentStream);
byte[] htmlAsByte = reader.ReadBytes((int)htmlSavingInfo.ContentStream.Length);
Console.WriteLine("Html page processed with handler. Length of page's text is bytes is " + htmlAsByte.Count().ToString());
// Here You can put code that will save page's HTML to some storage, f.e database
System.IO.MemoryStream targetStream = new System.IO.MemoryStream();
targetStream.Write(htmlAsByte, 0, htmlAsByte.Length);
}
private static string CssUrlMakingStrategy(HtmlSaveOptions.CssUrlRequestInfo requestInfo)
{
string template = "style{0}.css";
// one more example of template :
//string template = "http://localhost:24661/document-viewer/GetResourceForHtmlHandler?documentPath=Deutschland201207Arbeit.pdf&resourcePath=style{0}.css&fileNameOnly=false";
return template;
}
private static void CustomSavingOfCss(HtmlSaveOptions.CssSavingInfo resourceInfo)
{
System.IO.BinaryReader reader = new System.IO.BinaryReader(resourceInfo.ContentStream);
byte[] cssAsBytes = reader.ReadBytes((int)resourceInfo.ContentStream.Length);
Console.WriteLine("Css page processed with handler. Length of CSS is bytes is " + cssAsBytes.Count().ToString());
// Here You can put code that will save page's HTML to some storage, f.e database
System.IO.MemoryStream targetStream = new System.IO.MemoryStream();
targetStream.Write(cssAsBytes, 0, cssAsBytes.Length);
}
private static string CustomSaveOfFontsAndImages(Aspose.Pdf.SaveOptions.ResourceSavingInfo resourceSavingInfo)
{
// implementation omitted
System.IO.BinaryReader reader = new System.IO.BinaryReader(resourceSavingInfo.ContentStream);
byte[] resourceAsBytes = reader.ReadBytes((int)resourceSavingInfo.ContentStream.Length);
if (resourceSavingInfo.ResourceType == Aspose.Pdf.SaveOptions.NodeLevelResourceType.Font)
{
Console.WriteLine("Font processed with handler. Length of content in bytes is " + resourceAsBytes.Count().ToString());
// Here You can put code that will save font to some storage, f.e database
System.IO.MemoryStream targetStream = new System.IO.MemoryStream();
targetStream.Write(resourceAsBytes, 0, resourceAsBytes.Length);
}
else if (resourceSavingInfo.ResourceType == Aspose.Pdf.SaveOptions.NodeLevelResourceType.Image)
{
Console.WriteLine("Image processed with handler. Length of content in bytes is " + resourceAsBytes.Count().ToString());
// Here You can put code that will save image to some storage, f.e database
System.IO.MemoryStream targetStream = new System.IO.MemoryStream();
targetStream.Write(resourceAsBytes, 0, resourceAsBytes.Length);
}
// we should return URI by which resource will be referenced in CSS(for font)
// or HTML(for images) This is a very simplistic way - here we just return
// file name or resource.
// You can put here some URI that will include ID of resource in database etc.
// - this URI will be added into result CSS or HTML to refer the resource
return resourceSavingInfo.SupposedFileName;
}
Please feel free to contact us for any further assistance.
Best Regards
ok, we’ll try your example… but let me tell, we are searching for a solution in the same way of MS office’s files, so save PDF files as html with images, css and font in WOFF format embedded (as base64)
it’ll be possible with PDF?
thank you.
Hi Valerio,
yes you’re right: we’d like to obtain one html with all parts embedded (woff fonts, css, images…)
We’re trying to use your example and we read documentation but we can’t reach our goal
at the end of the work your method SavingOfAllPageHtmlsApart() execute
doc.Save(outHtmlFile, saveOptions);
it doesn’t save any real file but how can we save the final html converted file?
Our method is the same of your SavingOfAllPageHtmlsApart() (and we use your suggested methods for saving strategy) but at the end we need to return a stream containing the final html with all embedded things (then another object shoud put in an html viewer or something like)
Can you help us?
Thanks a lot
Hi Valerio,
We check documentation but noone example matches our scenario: we need to convert pdf in one unique stream with all embedded (like we do with aspose.words for docx files), sorry for previous misunderstanding…
The example in your documentation works but creates all accessory files (images, woff, css…) in phisical path and we can’t do like this, moreover the html stream generated can’t “link” to phisical images, woff and css because of the path.
We need to obtain one stream containing the html code with all embedded in it, we do it with a very small code for aspose.words:
Aspose.Words.Saving.HtmlFixedSaveOptions OptionsW = new Aspose.Words.Saving.HtmlFixedSaveOptions();
Aspose.Words.Saving.CssStyleSheetType.Embedded;
OptionsW.ShowPageBorder = false;
OptionsW.UseAntiAliasing = true;
OptionsW.ExportEmbeddedFonts = true;
OptionsW.ExportEmbeddedCss = true;
OptionsW.ExportEmbeddedImages = true;
((Aspose.Words.Document)m_oDocument).Save(_oOutputStream, OptionsW);
obtaining an html stream file right for our purposes (with fonts, css, images and so on embedded - see this kind of conversion result html code in attachment)
How can we get the same result for PDF files?
Hi Valerio,
Thanks for your feedback. You can embed all resources into a single stream using the following code sample. It tunes conversion in such a way that all output is forced to be embedded into the result HTML without external files, and then the result HTML is written into some stream with the code of a custom strategy for saving HTML. Hopefully, it will serve the purpose.
public static void PDFtoHTMLStream()
{
Document doc = new Document(@"F:\ExternalTestsData\36608.pdf");
// tune conversion params
HtmlSaveOptions newOptions = new HtmlSaveOptions();
newOptions.RasterImagesSavingMode = HtmlSaveOptions.RasterImagesSavingModes.AsEmbeddedPartsOfPngPageBackground;
newOptions.FontSavingMode = HtmlSaveOptions.FontSavingModes.SaveInAllFormats;
newOptions.PartsEmbeddingMode = HtmlSaveOptions.PartsEmbeddingModes.EmbedAllIntoHtml;
newOptions.LettersPositioningMethod = HtmlSaveOptions.LettersPositioningMethods.UseEmUnitsAndCompensationOfRoundingErrorsInCss;
newOptions.SplitIntoPages = false; // force write HTMLs of all pages into one output document
newOptions.CustomHtmlSavingStrategy = new HtmlSaveOptions.HtmlPageMarkupSavingStrategy(SavingToStream);
// we can use some non-existing path as result file name - all real saving will be done
// in our custom method SavingToStream() (it's follows this one)
string outHtmlFile = @"Z:\SomeNonExistingFolder\SomeUnexistingFile.html";
doc.Save(outHtmlFile, newOptions);
}
private static void SavingToStream(HtmlSaveOptions.HtmlPageMarkupSavingInfo htmlSavingInfo)
{
byte[] resultHtmlAsBytes = new byte[htmlSavingInfo.ContentStream.Length];
htmlSavingInfo.ContentStream.Read(resultHtmlAsBytes, 0, resultHtmlAsBytes.Length);
// here You can use any writable stream, file stream is taken just as example
string fileName = @"F:\ExternalTestsData\37544_stream_out.html";
Stream outStream = File.OpenWrite(fileName);
outStream.Write(resultHtmlAsBytes, 0, resultHtmlAsBytes.Length);
}
Please feel free to contact us for any further assistance.
Best Regards,
This works great, thank you!
Only two things more:
1 with this structure we need to use a static stream variable for stream passge from caller object to aspose and return: it will be possible to pass a ref stream as argument to the function SavingToStream(HtmlSaveOptions.HtmlPageMarkupSavingInfo htmlSavingInfo) ?
2 is it possible to changhe some way to set background image as background in htnl (now it saves as img tag and not as background)?
For us the best way could be the same code as, for example, aspose.words
Aspose.Words.Saving.HtmlFixedSaveOptions OptionsW = new Aspose.Words.Saving.HtmlFixedSaveOptions();
Aspose.Words.Saving.CssStyleSheetType.Embedded;
OptionsW.ShowPageBorder = false;
OptionsW.UseAntiAliasing = true;
OptionsW.ExportEmbeddedFonts = true;
OptionsW.ExportEmbeddedCss = true;
OptionsW.ExportEmbeddedImages = true;
((Aspose.Words.Document)m_oDocument).Save(_oOutputStream, OptionsW);
that saves output stream with entire html file with all embedded…
thank you
Hi Valerio,
Hi Valerio,
As a workaround, for the time being you may consider converting PDF files to MS DOC/DOCX format and try using Aspose.Words to accomplish your requirements (as stated in above code). For further details, please visit Convert PDF to DOC or DOCX format
The issues you have found earlier (filed as PDFNEWNET-37952) have been fixed in Aspose.Pdf for .NET 10.0.0.
This message was posted using Notification2Forum from Downloads module by Aspose Notifier.
hi, we just downloaded new version 10.0.0 but we can’t see any new parameter in CustomHtmlSavingStrategy
we use this syntax:
OptionsH.CustomHtmlSavingStrategy = new Aspose.Pdf.HtmlSaveOptions.HtmlPageMarkupSavingStrategy(SavingToStream);
maybe we have to change something… but how?
Hi Valerio,
Thanks for your inquiry. We have investigated your requirement (PDFNEWNET-37952) and would like to update you that the signature of the SavingToStream
method cannot be changed just to add some additional parameter since delegate of this method must be passed to newOptions.CustomHtmlSavingStrategy
.
private static void SavingToStream(HtmlSaveOptions.HtmlPageMarkupSavingInfo htmlSavingInfo)
{
}
However, your goal (saving of several output documents to the same static output stream) can be easily achieved with existing functionality of Aspose.Pdf. We only need to change a bit custom handler of output saving. Here is code snippet that does that, hopefully it will help you to accomplish the task.
// It can be any writable stream, file stream used only as example
static Stream _staticOutStream = File.OpenWrite(@"F:\ExternalTestsData\static_stream_out.html");
public static void PDFtoStaticHTMLStream_37952()
{
Document doc = new Document(@"F:\ExternalTestsData\HelloWorld.pdf");
// Tune conversion params for first saving
HtmlSaveOptions newOptions = new HtmlSaveOptions();
newOptions.RasterImagesSavingMode = HtmlSaveOptions.RasterImagesSavingModes.AsEmbeddedPartsOfPngPageBackground;
newOptions.FontSavingMode = HtmlSaveOptions.FontSavingModes.SaveInAllFormats;
newOptions.PartsEmbeddingMode = HtmlSaveOptions.PartsEmbeddingModes.EmbedAllIntoHtml;
newOptions.LettersPositioningMethod = HtmlSaveOptions.LettersPositioningMethods.UseEmUnitsAndCompensationOfRoundingErrorsInCss;
newOptions.SplitIntoPages = false; // force write HTMLs of all pages into one output document
newOptions.CustomHtmlSavingStrategy = new HtmlSaveOptions.HtmlPageMarkupSavingStrategy(SavingToStaticStream);
// We can use some non-existing path as result file name - all real saving will be done
// in our custom method SavingToStream() (it's follows this one)
string outHtmlFile = @"Z:\SomeNonExistingFolder\HelloWorld.html";
doc.Save(outHtmlFile, newOptions);
// 2) saving one more document in same stream (saving will really take place in SavingToStaticStream() method)
Document doc_2 = new Document(@"F:\ExternalTestsData\Test1.pdf");
// 2.1) tune conversion params
HtmlSaveOptions newOptions2 = new HtmlSaveOptions();
newOptions2.RasterImagesSavingMode = HtmlSaveOptions.RasterImagesSavingModes.AsEmbeddedPartsOfPngPageBackground;
newOptions2.FontSavingMode = HtmlSaveOptions.FontSavingModes.SaveInAllFormats;
newOptions2.PartsEmbeddingMode = HtmlSaveOptions.PartsEmbeddingModes.EmbedAllIntoHtml;
newOptions2.LettersPositioningMethod = HtmlSaveOptions.LettersPositioningMethods.UseEmUnitsAndCompensationOfRoundingErrorsInCss;
newOptions2.SplitIntoPages = false; // force write HTMLs of all pages into one output document
newOptions2.CustomHtmlSavingStrategy = new HtmlSaveOptions.HtmlPageMarkupSavingStrategy(SavingToStaticStream);
// 2.2) start saving itself
outHtmlFile = @"Z:\SomeNonExistingFolder\Test1.html";
doc_2.Save(outHtmlFile, newOptions);
Console.ReadKey();
}
private static void SavingToStaticStream(HtmlSaveOptions.HtmlPageMarkupSavingInfo htmlSavingInfo)
{
Console.WriteLine("Starting saving to static stream of output HTML document '" + htmlSavingInfo.SupposedFileName + "' ...");
byte[] resultHtmlAsBytes = new byte[htmlSavingInfo.ContentStream.Length];
htmlSavingInfo.ContentStream.Read(resultHtmlAsBytes, 0, resultHtmlAsBytes.Length);
// Locking allows to ensure that saving to static stream
// goes from one thread at a time and allows avoid
// interference
// between different threads(if any) during saving to same
// output thread
lock (_staticOutStream)
{
_staticOutStream.Write(resultHtmlAsBytes, 0, resultHtmlAsBytes.Length);
}
Console.WriteLine("Output HTML document '" + htmlSavingInfo.SupposedFileName + "' has been successfully saved to static stream.");
}
Please feel free to contact us for any further assistance.
Best Regards,
Yes, this is the method we use…
our two requests was:
PDFNEWNET-37952: support of stream parameter in CustomeHtmlSavingStrategy.
PDFNEWNET-37953: set Background image as background html tag.
when you wrote:
The issues you have found earlier (filed as PDFNEWNET-37952) have been fixed in Aspose.Pdf for .NET 10.0.0.
we thought was solved inserting stream parameter in Aspose.Pdf.HtmlSaveOptions.HtmlPageMarkupSavingStrategy
avoiding a static member…
so 37952 is solved with no changes and 37953 is still under development? right?
thank you
Hi Valerio,
Hi, any news about PDFNEWNET-37953 ?
Hi Valerio,