Hello, we need to convert pdf files in html stream, so we need to return a stream for html and some streams (or one only) with css, fonts etc…
Is there a way to do this? We try some of your examples but we cannot reach our goal
We cannot use any phisical path.
Thanks
Hi Valerio,
saveOptions.CustomHtmlSavingStrategy = new HtmlSaveOptions.HtmlPageMarkupSavingStrategy(StrategyOfSavingHtml);<o:p></o:p>
You must ensure that custom code returns desired URIs of resources. e.g. fonts are referenced in CSS, and to make result CSS referring saved font, saving method must return URI(that will be put into CSS body by converter to refer saved font).
public static void SavingOfAllPageHtmlsApart()
{
Document doc = new Document(@"C:\PDFTest\NimbusSRP.pdf");
// Pay attention that we put non-existing path here : since we use custon resource processing it won't be in use.
// If You forget implement some of required saving strategies(CustomHtmlSavingStrategy,CustomResourceSavingStrategy,CustomCssSavingStrategy), then saving will return "Path not found" exception
string outHtmlFile = @"T:\SomeNonExistingFolder\NimbusSRP.html";
// Create HtmlSaveOption with custom saving strategies that will do all the saving job
// in such approach You can split HTML in pages if You will
HtmlSaveOptions saveOptions = new HtmlSaveOptions();
saveOptions.SplitIntoPages = true;
saveOptions.CustomHtmlSavingStrategy = new HtmlSaveOptions.HtmlPageMarkupSavingStrategy(StrategyOfSavingHtml);
saveOptions.CustomResourceSavingStrategy = new HtmlSaveOptions.ResourceSavingStrategy(CustomSaveOfFontsAndImages);
saveOptions.CustomStrategyOfCssUrlCreation = new HtmlSaveOptions.CssUrlMakingStrategy(CssUrlMakingStrategy);
saveOptions.CustomCssSavingStrategy = new HtmlSaveOptions.CssSavingStrategy(CustomSavingOfCss);
saveOptions.FontSavingMode = HtmlSaveOptions.FontSavingModes.SaveInAllFormats;
saveOptions.RasterImagesSavingMode = HtmlSaveOptions.RasterImagesSavingModes.AsEmbeddedPartsOfPngPageBackground;
doc.Save(outHtmlFile, saveOptions);
Console.WriteLine("Done");
Console.ReadLine();
}
private static void StrategyOfSavingHtml(HtmlSaveOptions.HtmlPageMarkupSavingInfo htmlSavingInfo)
{
//get target file name and write content to it
System.IO.BinaryReader reader = new BinaryReader(htmlSavingInfo.ContentStream);
byte[] htmlAsByte = reader.ReadBytes((int)htmlSavingInfo.ContentStream.Length);
Console.WriteLine("Html page processed with handler. Length of page's text in bytes is " + htmlAsByte.Count().ToString());
// Here You can put code that will save page's HTML to some storage, f.e database
MemoryStream targetStream = new MemoryStream();
targetStream.Write(htmlAsByte, 0, htmlAsByte.Length);
}
private static string CssUrlMakingStrategy(HtmlSaveOptions.CssUrlRequestInfo requestInfo)
{
string template = "style{0}.css";
// one more example of template :
//string template = "http://localhost:24661/document-viewer/GetResourceForHtmlHandler?documentPath=Deutschland201207Arbeit.pdf&resourcePath=style{0}.css&fileNameOnly=false";
return template;
}
private static void CustomSavingOfCss(HtmlSaveOptions.CssSavingInfo resourceInfo)
{
System.IO.BinaryReader reader = new BinaryReader(resourceInfo.ContentStream);
byte[] cssAsBytes = reader.ReadBytes((int)resourceInfo.ContentStream.Length);
Console.WriteLine("Css page processed with handler. Length of css in bytes is " + cssAsBytes.Count().ToString());
// Here You can put code that will save page's HTML to some storage, f.e database
MemoryStream targetStream = new MemoryStream();
targetStream.Write(cssAsBytes, 0, cssAsBytes.Length);
}
private static string CustomSaveOfFontsAndImages(Aspose.Pdf.SaveOptions.ResourceSavingInfo resourceSavingInfo)
{
System.IO.BinaryReader reader = new BinaryReader(resourceSavingInfo.ContentStream);
byte[] resourceAsBytes = reader.ReadBytes((int)resourceSavingInfo.ContentStream.Length);
if (resourceSavingInfo.ResourceType == Aspose.Pdf.SaveOptions.NodeLevelResourceType.Font)
{
Console.WriteLine("Font processed with handler. Length of content in bytes is " + resourceAsBytes.Count().ToString());
// Here You can put code that will save font to some storage, f.e database
MemoryStream targetStream = new MemoryStream();
targetStream.Write(resourceAsBytes, 0, resourceAsBytes.Length);
}
else if (resourceSavingInfo.ResourceType == Aspose.Pdf.SaveOptions.NodeLevelResourceType.Image)
{
Console.WriteLine("Image processed with handler. Length of content in bytes is " + resourceAsBytes.Count().ToString());
// Here You can put code that will save image to some storage, f.e database
MemoryStream targetStream = new MemoryStream();
targetStream.Write(resourceAsBytes, 0, resourceAsBytes.Length);
}
// we should return URI bt which resource will be referenced in CSS(for font)
// or HTML(for images)
// This is very simplistic way - here we just return file name or resource.
// You can put here some URI that will include ID of resource in database etc.
// - this URI will be added into result CSS or HTML to refer the resource
return resourceSavingInfo.SupposedFileName;
}
Please feel free to contact us for any further assistance.
Best Regards,
ok, we’ll try your example… but let me tell, we are searching for a solution in the same way of MS office’s files, so save PDF files as html with images, css and font in WOFF format embedded (as base64)
it’ll be possible with PDF?
thank you.
Hi Valerio,
yes you’re right: we’d like to obtain one html with all parts embedded (woff fonts, css, images…)
We’re trying to use your example and we read documentation but we can’t reach our goal
at the end of the work your method SavingOfAllPageHtmlsApart() execute
doc.Save(outHtmlFile, saveOptions);
it doesn’t save any real file but how can we save the final html converted file?
Our method is the same of your SavingOfAllPageHtmlsApart() (and we use your suggested methods for saving strategy) but at the end we need to return a stream containing the final html with all embedded things (then another object shoud put in an html viewer or something like)
Can you help us?
Thanks a lot
Hi Valerio,
We check documentation but noone example matches our scenario: we need to convert pdf in one unique stream with all embedded (like we do with aspose.words for docx files), sorry for previous misunderstanding…
The example in your documentation works but creates all accessory files (images, woff, css…) in phisical path and we can’t do like this, moreover the html stream generated can’t “link” to phisical images, woff and css because of the path.
We need to obtain one stream containing the html code with all embedded in it, we do it with a very small code for aspose.words:
Aspose.Words.Saving.HtmlFixedSaveOptions OptionsW = new Aspose.Words.Saving.HtmlFixedSaveOptions();
Aspose.Words.Saving.CssStyleSheetType.Embedded;
OptionsW.ShowPageBorder = false;
OptionsW.UseAntiAliasing = true;
OptionsW.ExportEmbeddedFonts = true;
OptionsW.ExportEmbeddedCss = true;
OptionsW.ExportEmbeddedImages = true;
((Aspose.Words.Document)m_oDocument).Save(_oOutputStream, OptionsW);
obtaining an html stream file right for our purposes (with fonts, css, images and so on embedded - see this kind of conversion result html code in attachment)
How can we get the same result for PDF files?
Hi Valerio,
public static void PDFtoHTMLStream()<o:p></o:p>
{
Document doc = new Document(@"F:\ExternalTestsData\36608.pdf");
// tune conversion params
HtmlSaveOptions newOptions = new HtmlSaveOptions();
newOptions.RasterImagesSavingMode = HtmlSaveOptions.RasterImagesSavingModes.AsEmbeddedPartsOfPngPageBackground;
newOptions.FontSavingMode = HtmlSaveOptions.FontSavingModes.SaveInAllFormats;
newOptions.PartsEmbeddingMode = HtmlSaveOptions.PartsEmbeddingModes.EmbedAllIntoHtml;
newOptions.LettersPositioningMethod = HtmlSaveOptions.LettersPositioningMethods.UseEmUnitsAndCompensationOfRoundingErrorsInCss;
newOptions.SplitIntoPages = false;// force write HTMLs of all pages into one output document
newOptions.CustomHtmlSavingStrategy = new HtmlSaveOptions.HtmlPageMarkupSavingStrategy(SavingToStream);
//we can use some non-existing puth as result file name - all real saving will be done
//in our custom method SavingToStream() (it's follows this one)
string outHtmlFile = @"Z:\SomeNonExistingFolder\SomeUnexistingFile.html";
doc.Save(outHtmlFile, newOptions);
}
private static void SavingToStream(HtmlSaveOptions.HtmlPageMarkupSavingInfo htmlSavingInfo)
{
byte[] resultHtmlAsBytes = new byte[htmlSavingInfo.ContentStream.Length];
htmlSavingInfo.ContentStream.Read(resultHtmlAsBytes, 0, resultHtmlAsBytes.Length);
// here You can use any writable strem, file stream is taken just as example
string fileName = @"F:\ExternalTestsData\37544_stream_out.html";
Stream outStream = File.OpenWrite(fileName);
outStream.Write(resultHtmlAsBytes, 0, resultHtmlAsBytes.Length);
}
Please feel free to contact us for any further assistance.
Best Regards,
This works great, thank you!
Only two things more:
1 with this structure we need to use a static stream variable for stream passge from caller object to aspose and return: it will be possible to pass a ref stream as argument to the function SavingToStream(HtmlSaveOptions.HtmlPageMarkupSavingInfo htmlSavingInfo) ?
2 is it possible to changhe some way to set background image as background in htnl (now it saves as img tag and not as background)?
For us the best way could be the same code as, for example, aspose.words
Aspose.Words.Saving.HtmlFixedSaveOptions OptionsW = new Aspose.Words.Saving.HtmlFixedSaveOptions();
Aspose.Words.Saving.CssStyleSheetType.Embedded;
OptionsW.ShowPageBorder = false;
OptionsW.UseAntiAliasing = true;
OptionsW.ExportEmbeddedFonts = true;
OptionsW.ExportEmbeddedCss = true;
OptionsW.ExportEmbeddedImages = true;
((Aspose.Words.Document)m_oDocument).Save(_oOutputStream, OptionsW);
that saves output stream with entire html file with all embedded…
thank you
Hi Valerio,
Hi Valerio,
As a workaround, for the time being you may consider converting PDF files to MS DOC/DOCX format and try using Aspose.Words to accomplish your requirements (as stated in above code). For further details, please visit Convert PDF to DOC or DOCX format
The issues you have found earlier (filed as PDFNEWNET-37952) have been fixed in Aspose.Pdf for .NET 10.0.0.
This message was posted using Notification2Forum from Downloads module by Aspose Notifier.
hi, we just downloaded new version 10.0.0 but we can’t see any new parameter in CustomHtmlSavingStrategy
we use this syntax:
OptionsH.CustomHtmlSavingStrategy = new Aspose.Pdf.HtmlSaveOptions.HtmlPageMarkupSavingStrategy(SavingToStream);
maybe we have to change something… but how?
Hi Valerio,
private static void SavingToStream(HtmlSaveOptions.HtmlPageMarkupSavingInfo htmlSavingInfo)<o:p></o:p>
{
}
However, you goal(saving of several output documents to same static output stream) can be easily achieved with existing functionality of Aspose.Pdf. We only need to change a bit custom handler of output saving. Here is code snippet that does that, hopefully it will help you to accomplish the task.
// it can be any writable stream, file stream used only as example
static Stream _staticOutStream = File.OpenWrite(@"F:\ExternalTestsData\static_stream_out.html");
public static void PDFtoStaticHTMLStream_37952()
{
Document doc = new Document(@"F:\ExternalTestsData\HelloWorld.pdf");
// tune conversion params for first saving
HtmlSaveOptions newOptions = new HtmlSaveOptions();
newOptions.RasterImagesSavingMode = HtmlSaveOptions.RasterImagesSavingModes.AsEmbeddedPartsOfPngPageBackground;
newOptions.FontSavingMode = HtmlSaveOptions.FontSavingModes.SaveInAllFormats;
newOptions.PartsEmbeddingMode = HtmlSaveOptions.PartsEmbeddingModes.EmbedAllIntoHtml;
newOptions.LettersPositioningMethod = HtmlSaveOptions.LettersPositioningMethods.UseEmUnitsAndCompensationOfRoundingErrorsInCss;
newOptions.SplitIntoPages = false;// force write HTMLs of all pages into one output document
newOptions.CustomHtmlSavingStrategy = new HtmlSaveOptions.HtmlPageMarkupSavingStrategy(SavingToStaticStream);
//we can use some non-existing puth as result file name - all real saving will be done
//in our custom method SavingToStream() (it's follows this one)
string outHtmlFile = @"Z:\SomeNonExistingFolder\HelloWorld.html";
doc.Save(outHtmlFile, newOptions);
// 2) saving one more document in same stream(saving will really take place in SavingToStaticStream() method)
Document doc_2 = new Document(@"F:\ExternalTestsData\Test1.pdf");
// 2.1)tune conversion params
HtmlSaveOptions newOptions2 = new HtmlSaveOptions();
newOptions2.RasterImagesSavingMode = HtmlSaveOptions.RasterImagesSavingModes.AsEmbeddedPartsOfPngPageBackground;
newOptions2.FontSavingMode = HtmlSaveOptions.FontSavingModes.SaveInAllFormats;
newOptions2.PartsEmbeddingMode = HtmlSaveOptions.PartsEmbeddingModes.EmbedAllIntoHtml;
newOptions2.LettersPositioningMethod = HtmlSaveOptions.LettersPositioningMethods.UseEmUnitsAndCompensationOfRoundingErrorsInCss;
newOptions2.SplitIntoPages = false;// force write HTMLs of all pages into one output document
newOptions2.CustomHtmlSavingStrategy = new HtmlSaveOptions.HtmlPageMarkupSavingStrategy(SavingToStaticStream);
// 2.2)start saving itself
outHtmlFile = @"Z:\SomeNonExistingFolder\Test1.html";
doc_2.Save(outHtmlFile, newOptions);
//
Console.ReadKey();
}
private static void SavingToStaticStream(HtmlSaveOptions.HtmlPageMarkupSavingInfo htmlSavingInfo)
{
Console.WriteLine("Starting saving to static stream of output HTML document '" + htmlSavingInfo.SupposedFileName + "' ...");
byte[] resultHtmlAsBytes = new byte[htmlSavingInfo.ContentStream.Length];
htmlSavingInfo.ContentStream.Read(resultHtmlAsBytes, 0, resultHtmlAsBytes.Length);
// locking allows to ensure that saving to static stream
// goes from one thread a time and allows avoid interference
// between different threads(if any) during saving to same output thread
lock (_staticOutStream)
{
_staticOutStream.Write(resultHtmlAsBytes, 0, resultHtmlAsBytes.Length);
}
Console.WriteLine("Output HTML document '" + htmlSavingInfo.SupposedFileName + "' has been successfully saved to static stream.");
}
Please feel free to contact us for any further assistance.
Best Regards,
Yes, this is the method we use…
our two requests was:
PDFNEWNET-37952: support of stream parameter in CustomeHtmlSavingStrategy.
PDFNEWNET-37953: set Background image as background html tag.
when you wrote:
The issues you have found earlier (filed as PDFNEWNET-37952) have been fixed in Aspose.Pdf for .NET 10.0.0.
we thought was solved inserting stream parameter in Aspose.Pdf.HtmlSaveOptions.HtmlPageMarkupSavingStrategy
avoiding a static member…
so 37952 is solved with no changes and 37953 is still under development? right?
thank you
Hi Valerio,
Hi, any news about PDFNEWNET-37953 ?
Hi Valerio,