Pdf->html without external img/font etc. files

ginoleovac · January 28, 2014, 10:51am

Hi,

I’m looking for a way to convert pdf to html with embedded images / fonts / css.

My pdfs are rather simplistic, text & tables type of things, where I we use their (x)html representation in web services federation exchanges. Managing cross service file locations and refs is rather impractical.

Tried different options in your API, but invariably external files are written. Is there such option?

Regards,

Gino

codewarior · January 28, 2014, 12:05pm

Hi Gino,

Thanks for contacting support.

I am afraid the conversion of PDF to single Web Archive (MHT) is currently not supported. However for the sake of implementation, I have logged this requirement as PDFNEWNET-36340 in our issue tracking system. We will further
look into the details of this feature and will keep you updated on the status
of correction. Please be patient and spare us little time. We are sorry for
this inconvenience.

ginoleovac · January 29, 2014, 8:00am

Hi Nayyer,

In between time, I tried a bit of a workaround, a two step pdf->doc->html, via Aspose.Words. The doc->html conversion seem to be working fine, and I’m able to roll up all-in-one html without external files referenced.

The issue is that I cannot get the pdf->doc(x) conversion to align text & tables properly. The tables appear as images in the document, while the text appears in anchored frame boxes… (pdf and doc zip attached). Not sure if I’m using correct save options:

var docXOptions = new Aspose.Pdf.DocSaveOptions

{

Format = Aspose.Pdf.DocSaveOptions.DocFormat.DocX,

RecognizeBullets = true,

Mode = Aspose.Pdf.DocSaveOptions.RecognitionMode.Flow, //.Textbox

};

Thanks,

Gino

codewarior · January 29, 2014, 10:21am

Hi Gino,

Thanks for sharing the feedback.

When you click the outer border of table, it appears as Image object. However you can select/click the table contents (contents inside table cell). Furthermore, the text appears as anchored inside frame boxes because RecognitionMode.Textbox is used as recognition mode. For your reference, I have also attached the resultant DOCX file generated over my end using Aspose.Pdf for .NET 8.8.0.

ginoleovac · January 29, 2014, 10:32am

Hi Nayyer,

The document looks very good. Could you please provide the exact doc save options you used?

Thanks,

Gino

codewarior · January 29, 2014, 11:21am

Hi Gino,

I have used the following code snippet with Aspose.Pdf for .NET 8.8.0.

[C#]

Document doc = new
Document(@“C:\pdftest\test2-deid.rtf.Aspose.pdf”);<o:p></o:p>

var docXOptions = new Aspose.Pdf.DocSaveOptions

{

Format = Aspose.Pdf.DocSaveOptions.DocFormat.DocX,

RecognizeBullets = true,

Mode = Aspose.Pdf.DocSaveOptions.RecognitionMode.Textbox, //.Textbox

};

doc.Save(@"C:\pdftest\test2-deid.rtf.Aspose_8_8_0.docx", docXOptions);

ginoleovac · January 31, 2014, 10:37am

I noticed an odd thing - when I open the resulting document w/Word on Mac, it renders correctly. The very same doc in Win Word 2013 shows misalignments. Only when I enforce ‘compatibility’ mode, I get close to acceptable results.

Ultimately, this did not solve the core issue. Word document still treats tables as images, hence, the further transform to html is pretty much useless, as table entries really do not exist any more.

It is pretty much a show stopper; pdf->(x)html as all-in-one roll up cannot work this way. Thanks for your help in assessing all the details.

Regards, Gino.

tilal.ahmad · February 3, 2014, 1:34am

Hi Gino,

Thanks for sharing your findings. We have also noticed the table structure issue while converting PDF to DOCX and then later converting it to HTML, logged this issue as PDFNEWNET-36357 for further investigation and resolution. We will keep you updated about these issues resolution progress via this forum thread.

We are truly sorry for the inconvenience caused.

Best Regards,

aspose.notifier · September 5, 2014, 2:31am

The issues you have found earlier (filed as PDFNEWNET-36340) have been fixed in Aspose.Pdf for .NET 9.6.0.

This message was posted using Notification2Forum from Downloads module by Aspose Notifier.

mattshelton · September 22, 2014, 2:50pm

v9.6 indeed can write a single, all-in-one html output file.

I tried to use the same settings for writing to the stream, but could not find a winning combination of HtmlSaveOptions properties. Is that feature supported for stream outputs?

Thanks

codewarior · September 23, 2014, 3:09am

Hi Matt,

Thanks for contacting support.

In case your requirement is to save PDF to HTML conversion output in stream object, then please follow the instructions specified over PDF to HTML - Save output in Stream object

In case I have not properly understood your requirement, please share some further details.

mattshelton · September 23, 2014, 7:00am

I was hoping that I could use the same options when saving to all-in-one file to create a all-in-one stream (demo case below). I’m running the extraction in service architecture, and am trying to avoid disk I/O as much as possible.

I’m not sure I understand how to create required strategies to write respective sections (css saving; resource saving) with instructions to write into the output stream; i.e. return values in strategy signatures require either disk or url location, which make little sense if I’m trying to roll up everything into a single stream.

Is such conversion path supported?

	[TestCase( @"..\..\..\_asposeCases\20140729_191959_1206173.pdf.deid.pdf" )] // any pdf file will do
	// https://forum.aspose.com/t/76908
	public void PdfToHtmlStream( string pdfFile ) 
	{ 
		var pdfDoc = new Document( pdfFile );
		var options = new HtmlSaveOptions
			{
				DocumentType = HtmlDocumentType.Xhtml,
				// as of Aspose 9.6 version
				PartsEmbeddingMode = HtmlSaveOptions.PartsEmbeddingModes.EmbedAllIntoHtml,
				RasterImagesSavingMode = HtmlSaveOptions.RasterImagesSavingModes.AsEmbeddedPartsOfPngPageBackground, // only option when .EmbedAllIntoHtml
				LettersPositioningMethod = HtmlSaveOptions.LettersPositioningMethods.UseEmUnitsAndCompensationOfRoundingErrorsInCss,
			};

		var outHtmlFile = pdfFile + ".html";
		Assert.DoesNotThrow( () => pdfDoc.Save( outHtmlFile, options ) );
		
		var outStream = new MemoryStream();
		Assert.DoesNotThrow( () => pdfDoc.Save( outStream, options ) );
		// Aspose exception thrown:
		// (Inconsistent saving options detected : 'CustomStrategyOfCssUrlCreation','CustomCssSavingStrategy','CustomResourceSavingStrategy' may not be null when requested saving to stream!)
	}

codewarior · September 24, 2014, 5:55am

mattshelton:
I was hoping that I could use the same options when saving to all-in-one file to create a all-in-one stream (demo case below). I'm running the extraction in service architecture, and am trying to avoid disk I/O as much as possible.
I’m not sure I understand how to create required strategies to write respective sections (css saving; resource saving) with instructions to write into the output stream; i.e. return values in strategy signatures require either disk or url location, which make little sense if I’m trying to roll up everything into a single stream.

Is such conversion path supported?
	[TestCase( @"..\..\..\_asposeCases\20140729_191959_1206173.pdf.deid.pdf" )] // any pdf file will do
	// https://forum.aspose.com/t/76908
	public void PdfToHtmlStream( string pdfFile ) 
	{ 
		var pdfDoc = new Document( pdfFile );
		var options = new HtmlSaveOptions
			{
				DocumentType = HtmlDocumentType.Xhtml,
				// as of Aspose 9.6 version
				PartsEmbeddingMode = HtmlSaveOptions.PartsEmbeddingModes.EmbedAllIntoHtml,
				RasterImagesSavingMode = HtmlSaveOptions.RasterImagesSavingModes.AsEmbeddedPartsOfPngPageBackground, // only option when .EmbedAllIntoHtml
				LettersPositioningMethod = HtmlSaveOptions.LettersPositioningMethods.UseEmUnitsAndCompensationOfRoundingErrorsInCss,
			};

		var outHtmlFile = pdfFile + ".html";
		Assert.DoesNotThrow( () => pdfDoc.Save( outHtmlFile, options ) );
		
		var outStream = new MemoryStream();
		Assert.DoesNotThrow( () => pdfDoc.Save( outStream, options ) );
		// Aspose exception thrown:
		// (Inconsistent saving options detected : 'CustomStrategyOfCssUrlCreation','CustomCssSavingStrategy','CustomResourceSavingStrategy' may not be null when requested saving to stream!)
	}</div></BLOCKQUOTE><div>Hi Matt,</div><div><br></div><div>Thanks for sharing the details.</div><div><br></div><div>I have logged the above stated requirement in our issue tracking system as <b>PDFNEWNET-37544</b>. We
will investigate this requirement in details and will keep you updated on the status
of a correction.
We apologize for your inconvenience.

aspose.notifier · October 8, 2014, 2:56pm

The issues you have found earlier (filed as PDFNEWNET-37544) have been fixed in Aspose.Pdf for .NET 9.7.0.

This message was posted using Notification2Forum from Downloads module by Aspose Notifier.

mattshelton · October 10, 2014, 8:47am


Hi



Per announcement

The issues you have found earlier (filed as PDFNEWNET-37544) have been fixed in 

Aspose.Pdf for .NET 9.7.0.



I tried the case again without success, per example submitted in previous message 

in the thread.



Could you provide example code snippet with desired HtmlSaveOptions properties that 

would allow me to save the pdf to stream as xhtml?



Thanks

codewarior · October 13, 2014, 8:51am

Hi Matt,

Thanks for sharing the feedback.

The result can be achieved using following code snippet where output is forced to be embedded into result HTML without external files and then result HTML is written into some stream with code of custom strategy of saving of HTML :

[C#]

Document doc = new Document(@“F:\ExternalTestsData\36608.pdf”);<o:p></o:p>

// tune conversion params

HtmlSaveOptions newOptions = new HtmlSaveOptions();

newOptions.RasterImagesSavingMode = HtmlSaveOptions.RasterImagesSavingModes.AsEmbeddedPartsOfPngPageBackground;

newOptions.FontSavingMode = HtmlSaveOptions.FontSavingModes.SaveInAllFormats;

newOptions.PartsEmbeddingMode = HtmlSaveOptions.PartsEmbeddingModes.EmbedAllIntoHtml;

newOptions.LettersPositioningMethod = HtmlSaveOptions.LettersPositioningMethods.UseEmUnitsAndCompensationOfRoundingErrorsInCss;

newOptions.SplitIntoPages=false;// force write HTMLs of all pages into one output document

newOptions.CustomHtmlSavingStrategy = new HtmlSaveOptions.HtmlPageMarkupSavingStrategy(SavingToStream);

//we can use some non-existing puth as result file name - all real saving will be done

//in our custom method SavingToStream() (it's follows this one)

string outHtmlFile = @"Z:\SomeNonExistingFolder\SomeUnexistingFile.html";

doc.Save(outHtmlFile, newOptions);

}

private static void SavingToStream(HtmlSaveOptions.HtmlPageMarkupSavingInfo htmlSavingInfo)

{

byte[] resultHtmlAsBytes = new byte[htmlSavingInfo.ContentStream.Length];

htmlSavingInfo.ContentStream.Read(resultHtmlAsBytes, 0, resultHtmlAsBytes.Length);

// here You can use any writable strem, file stream is taken just as example

string fileName = @"F:\ExternalTestsData\37544_stream_out.html";

Stream outStream = File.OpenWrite(fileName);

outStream.Write(resultHtmlAsBytes, 0, resultHtmlAsBytes.Length);

}

PS, (in example file system used just to make possible to see result in browser - it's possible to use any writable stream as result's receiver).

matucker · July 8, 2015, 3:10pm

Hello,

I am currently evaluating the aspose Total product, and am working on a prototype right now.

I have the Words code working properly, but cannot replicate the same approach in PDF. This code seems closest, but at the point of writing a file to disk, I want instead to have a handle to the final html as an OutputStream for return to a client program (this is being executed in a web service). Please advise how I can accomplish that.

Thanks,

Mark Tucker

tilal.ahmad · July 9, 2015, 11:55pm

Hi Mark,

Thanks for your inquiry. You may achieve your requirement by saving output HTML into static stream. Please check following sample code for the purpose.

// it can be any writable stream, file
stream used only as example <o:p></o:p>

static Stream _staticOutStream = File.OpenWrite(@"F:\ExternalTestsData\static_stream_out.html");

public static void PDFtoStaticHTMLStream_37952()

{

Document doc = new Document(@"F:\ExternalTestsData\HelloWorld.pdf");

// tune conversion params for first saving

HtmlSaveOptions newOptions = new HtmlSaveOptions();

newOptions.RasterImagesSavingMode = HtmlSaveOptions.RasterImagesSavingModes.AsEmbeddedPartsOfPngPageBackground;

newOptions.FontSavingMode = HtmlSaveOptions.FontSavingModes.SaveInAllFormats;

newOptions.PartsEmbeddingMode = HtmlSaveOptions.PartsEmbeddingModes.EmbedAllIntoHtml;

newOptions.LettersPositioningMethod = HtmlSaveOptions.LettersPositioningMethods.UseEmUnitsAndCompensationOfRoundingErrorsInCss;

newOptions.SplitIntoPages = false;// force write HTMLs of all pages into one output document

newOptions.CustomHtmlSavingStrategy = new HtmlSaveOptions.HtmlPageMarkupSavingStrategy(SavingToStaticStream);

//we can use some non-existing puth as result file name - all real saving will be done

//in our custom method SavingToStream() (it's follows this one)

string outHtmlFile = @"Z:\SomeNonExistingFolder\HelloWorld.html";

doc.Save(outHtmlFile, newOptions);

}

private static void SavingToStaticStream(HtmlSaveOptions.HtmlPageMarkupSavingInfo htmlSavingInfo)

{

Console.WriteLine("Starting saving to static stream of output HTML document '" + htmlSavingInfo.SupposedFileName + "' ...");

byte[] resultHtmlAsBytes = new byte[htmlSavingInfo.ContentStream.Length];

htmlSavingInfo.ContentStream.Read(resultHtmlAsBytes, 0, resultHtmlAsBytes.Length);

// locking allows to ensure that saving to static stream

// goes from one thread a time and allows avoid interference

// between different threads(if any) during saving to same output thread

lock (_staticOutStream)

{

_staticOutStream.Write(resultHtmlAsBytes, 0, resultHtmlAsBytes.Length);

}

Console.WriteLine("Output HTML document '" + htmlSavingInfo.SupposedFileName + "' has been successfully saved to static stream.");

}

Please feel free to contact us for any further assistance.

Best Regards,

matucker · July 13, 2015, 12:28pm

Thank you.

Can you please provide a java version of this code?

Thanks,

Mark

tilal.ahmad · July 14, 2015, 10:46pm

Hi Mark,

Thanks for your inquiry. Please check following Java code snippet to save HTML in stream with embedded resources. It will help you to accomplish the task.

Document doc = new Document(myDir+“Input.pdf”);<o:p></o:p>

// tune conversion params<o:p></o:p>

HtmlSaveOptions newOptions = new HtmlSaveOptions();<o:p></o:p>

newOptions.RasterImagesSavingMode = HtmlSaveOptions.RasterImagesSavingModes.AsEmbeddedPartsOfPngPageBackground;<o:p></o:p>

newOptions.FontSavingMode = HtmlSaveOptions.FontSavingModes.SaveInAllFormats;<o:p></o:p>

newOptions.PartsEmbeddingMode = HtmlSaveOptions.PartsEmbeddingModes.EmbedAllIntoHtml;<o:p></o:p>

newOptions.LettersPositioningMethod = LettersPositioningMethods.UseEmUnitsAndCompensationOfRoundingErrorsInCss;<o:p></o:p>

newOptions.setSplitIntoPages(false);// force write HTMLs of all pages into one output document<o:p></o:p>

newOptions.CustomHtmlSavingStrategy = new HtmlSaveOptions.HtmlPageMarkupSavingStrategy() {<o:p></o:p>

<o:p></o:p>

@Override<o:p></o:p>

public void invoke(HtmlPageMarkupSavingInfo htmlSavingInfo) {<o:p></o:p>

// TODO Auto-generated method stub<o:p></o:p>

byte[] resultHtmlAsBytes = new byte[(int) htmlSavingInfo.ContentStream.getLength()];<o:p></o:p>

htmlSavingInfo.ContentStream.read(resultHtmlAsBytes, 0, resultHtmlAsBytes.length);<o:p></o:p>

// here You can use any writable stream, file stream is taken just as example<o:p></o:p>

FileOutputStream fos;<o:p></o:p>

try {<o:p></o:p>

fos = new FileOutputStream(myDir+“temp/PDFtoHTML.html”);<o:p></o:p>

fos.write(resultHtmlAsBytes);<o:p></o:p>

fos.close();<o:p></o:p>

} catch (FileNotFoundException e) {<o:p></o:p>

// TODO Auto-generated catch block<o:p></o:p>

e.printStackTrace();<o:p></o:p>

} catch (IOException e) {<o:p></o:p>

// TODO Auto-generated catch block<o:p></o:p>

e.printStackTrace();<o:p></o:p>

}<o:p></o:p>

<o:p></o:p>

}<o:p></o:p>

};<o:p></o:p>

//we can use some non-existing path as result file name - all real saving will be done in CustomerHtmlSavingStrategy

String outHtmlFile = “Z:/SomeNonExistingFolder/SomeUnexistingFile.html”;<o:p></o:p>

doc.save(outHtmlFile, newOptions);

Please feel free to contact us for any further assistance.

Best Regards,

Pdf-&gt;html without external img/font etc. files

Pdf->html without external img/font etc. files