PDF to HTML - Retrieve raw HTML and Handle Images

I am attempting to save a PDF to HTML, get the resulting html, and override the logic that saves the images, I need to handle this in a very specific manner.


1. I need to be able to get the resulting raw HTML from the save process of converting the PDF to HTML.

For this first process, in your other products, such as Words, I am able to save to a memory stream. From there I can do whatever it is I need to. It doesn’t have to be a MemoryStream, but I need to be able to read that raw html, without writing to disk. I keep getting continued errors with saving a Pdf to a MemoryStream as the format of HTML. Is this at all feasible?

If the above is not possible, how do I go about converting a PDF to a Word document? I already have code in place that does the above from a Word document.

2. I need to be able to dictate where any embedded images are saved to. With the Aspose.Words product this is handled in an HtmlSaveOption with an ImageSavingCallback. I’ve included the code below which works in the Aspose.Words namespace, is there any way I can invoke something similar for the PDF saving that I’m missing?

//Aspose.Words method - Working great!
MemoryStream writeStream = new MemoryStream();
HtmlSaveOptions options = new HtmlSaveOptions(SaveFormat.Html);
options.ImageSavingCallback = new AsposeImageSavingCallback();
doc.Save(writeStream, options);

If something like the above is not possible… how can I control how the images are saved and the tags are generated in the resulting save file?


(I found these two links to be somewhat helpful, but not entirely what I need)
<a href="
Replace Image in Existing PDF File|Aspose.PDF for .NET

Hi Philip,

Thanks for your inquiry.

philip.betts:

  1. I need to be able to get the resulting raw HTML from the save process of converting the PDF to HTML.
    For this first process, in your other products, such as Words, I am able to save to a memory stream. From there I can do whatever it is I need to. It doesn’t have to be a MemoryStream, but I need to be able to read that raw html, without writing to disk. I keep getting continued errors with saving a Pdf to a MemoryStream as the format of HTML. Is this at all feasible?

If the above is not possible, how do I go about converting a PDF to a Word document? I already have code in place that does the above from a Word document.

I’m afraid currently Aspose.Pdf doesn’t support saving HTML to MemoryStream. However we’ve already logged a feature request for the same as PDFNEWNET-34748 in our issue tracking system. We’ll update you as soon as it gets available.

Furthermore, please check documentation link for details/code snippet to convert PDF document to DOC/DOCX.

philip.betts:
2. I need to be able to dictate where any embedded images are saved to. With the Aspose.Words product this is handled in an HtmlSaveOption with an ImageSavingCallback. I’ve included the code below which works in the Aspose.Words namespace, is there any way I can invoke something similar for the PDF saving that I’m missing?

Moreover, specifying location to save images during PDF to HTML conversion is also not supported at the moment. We’ve also logged this requirement as PDFNEWNET-35609 in our issue tracking system for further investigation and resolution. We’ll keep you updated about the issues progress via this forum thread.

Sorry for the inconvenience faced.

Best Regards,

Thanks for the reply.


Until those logged issues are fixed, I have a follow up question for you. I am able to save the PDF as a Doc format to disk just fine. I open it up, looks great. However, when attempting to save to a MemoryStream (So I can open it with Aspose.Words), I receive errors.

What is the proper way to save a PDF into a MemoryStream with the SaveFormat.Doc?

Hi Philip,


Sorry for the delayed response.

I have tested the scenario using Aspose.Pdf for .NET 8.2.0 where I have used the following code lines, and I am unable to notice any problem. Can you please share which version of Aspose.Pdf for .NET you are using ?

In case you are using the latest release, please share the PDF file which you are trying to convert to DOC format. We are sorry for this inconvenience.

I should have clarified, saving to the MemoryStream returns no immediate visible errors. But if you look at the MemoryStream you can see the issue. [Using: Aspose 8.2.0.0]

I've attached an image of the test code to reproduce what I was talking about. If I were to step over the next line of code, where the Aspose.Words.Document reads the memory stream, I get an error: "The document appears to be corrupted and cannot be loaded."


Hi Philip,


Thanks for providing additional information. Can you please double check the license implementation in your code? As it seems you are not setting license for Aspose.Pdf. I’m getting the exception when testing the scenario without license implementation, its working fine with proper license setting.

Please feel free to contact us for any further assistance.

Best Regards,
//Initialization - License key
Stream sr = new MemoryStream(MyProduct.MyNamespace.Web.Resources.Aspose_Total);
Aspose.Diagram.License adl = new Aspose.Diagram.License();
adl.SetLicense(sr);

sr.Seek(0,SeekOrigin.Begin);

Aspose.Cells.License acl = new Aspose.Cells.License();
acl.SetLicense(sr);

sr.Seek(0, SeekOrigin.Begin);
Aspose.Pdf.License l = new Aspose.Pdf.License();
l.SetLicense(sr);

sr.Seek(0,SeekOrigin.Begin);

Aspose.Words.License awl = new Aspose.Words.License();
awl.SetLicense(sr);

sr.Seek(0,SeekOrigin.Begin);

Aspose.Slides.License asl = new Aspose.Slides.License();
asl.SetLicense(sr);
sr.Close();

Hi Philip,


Thanks for your feedback. It would help us to replicate the issue at our end and investigate it further if you please share a sample application to replicate the issue.

Sorry for the inconvenience faced.

Best Regards,
philip.betts:
I should have clarified, saving to the MemoryStream returns no immediate visible errors. But if you look at the MemoryStream you can see the issue. [Using: Aspose 8.2.0.0]

I've attached an image of the test code to reproduce what I was talking about. If I were to step over the next line of code, where the Aspose.Words.Document reads the memory stream, I get an error: "The document appears to be corrupted and cannot be loaded."
Hi Philip,

Adding more to Tilal's comments, I have tested the scenario and I am able to reproduce the same problem that when using Aspose.Pdf for .NET in trial mode, the exception is occurring. For the sake of correction, I have logged it in our issue tracking system as PDFNEWNET-35692. We will investigate this issue in details and will keep you updated on the status of a correction.

We apologize for your inconvenience.

Hi Philip,


Thanks for your patience.

I am pleased to share that the feature of saving images in separate folder while converting PDF file to HTML format is implemented and its fix will be included in upcoming release of Aspose.Pdf for .NET 8.6.0, which is planned to release in next few days. In order accomplish this requirement, please try using the following code snippet.

[C#]

Document doc = new
Document(“c:/pdftest/source.pdf”);<o:p></o:p>

// create HtmlSaveOption with tested feature

HtmlSaveOptions newOptions = new HtmlSaveOptions();

newOptions.SpecialFolderForAllImages = @"C:\pdftest\htmlresources\";

doc.Save(“c:/pdftest/Final-Report.html”,newOptions);

The issues you have found earlier (filed as PDFNEWNET-35609) have been fixed in Aspose.Pdf for .NET 8.6.0.


This message was posted using Notification2Forum from Downloads module by Aspose Notifier.

To clarify for my second point, while the option that is available in 8.6.0 may prove beneficial to some it does not help with the situation I am presented with. I’m unable to do any writing to disk, so having a folder where my images are saved to does not help me.


Effectively what I have going on with Aspose.Words, I take advantage of their image saving callback. When I hit an image inside the document that is being processed, the callback fires off and I take the image, and hand that image off to part of my code which saves to a database and has an ID associated with it. That ID is then used later when attempting to view the page. I am able to accomplish this flawlessly with Word, but I need this same functionality within PDF.

When I can no longer hold this off, I will most likely have to generate my PDFs from Words, saving the file as a PDF at the end. I would rather like to avoid this to have the file be as natively created as possible, but can’t until I have that Callback
philip.betts:
Effectively what I have going on with Aspose.Words, I take advantage of their image saving callback. When I hit an image inside the document that is being processed, the callback fires off and I take the image, and hand that image off to part of my code which saves to a database and has an ID associated with it. That ID is then used later when attempting to view the page. I am able to accomplish this flawlessly with Word, but I need this same functionality within PDF.

When I can no longer hold this off, I will most likely have to generate my PDFs from Words, saving the file as a PDF at the end. I would rather like to avoid this to have the file be as natively created as possible, but can't until I have that Callback
Hi Philip,

Thanks for sharing the details.

As per my understanding, you need to get an access of individual images inside the PDF file and a leverage to manipulate them as per requirement (store them to database or on local system) and the user should not be limited to saving the Image files to disk during conversion. Furthermore, from callback, do you mean some sort of trigger which will be fired when parsing engine hits an image ? Please share your feedback as it will help us in understanding your requirement.

While a PDF is being converted (saved) into HTML, I need to be able to have some kind of hook / callback which will trigger when it parses over each image. This allows me to manipulate where the file is saved (In my case to a database). When each image is processed the callback modifies the image’s file name to a specialized url with the image’s database id entry.


In the working code I have with Aspose.Words the process works as follows: I upload the document for parsing. I pass the upload stream into Aspose by creating a new document with that stream. I set many of the saving options I need, then I save the Aspose.Words document. As the save is running, whenever it processes an image, it hits the callback. Once there, I take the object (its a ShapeBase) and pass it to my database calls to save the image and return its reference ID. I then build out a specialized string which represents how that image is referenced in html, in our system. I set that specialized string to the Image’s File name, and thus handling of that image is finished. The save process will continue parsing the file and converting it to HTML until it hits another image where the process repeats until it reaches the end of the document.

The final result is a completely converted document where the images are saved into my database and the resulting html where the images were located has been replaced with my special image reference string.

This is exactly what Aspose.Words already allows me, I’d say please confer with that team.

philip.betts:
While a PDF is being converted (saved) into HTML, I need to be able to have some kind of hook / callback which will trigger when it parses over each image. This allows me to manipulate where the file is saved (In my case to a database). When each image is processed the callback modifies the image’s file name to a specialized url with the image’s database id entry.

In the working code I have with Aspose.Words the process works as follows: I upload the document for parsing. I pass the upload stream into Aspose by creating a new document with that stream. I set many of the saving options I need, then I save the Aspose.Words document. As the save is running, whenever it processes an image, it hits the callback. Once there, I take the object (its a ShapeBase) and pass it to my database calls to save the image and return its reference ID. I then build out a specialized string which represents how that image is referenced in html, in our system. I set that specialized string to the Image’s File name, and thus handling of that image is finished. The save process will continue parsing the file and converting it to HTML until it hits another image where the process repeats until it reaches the end of the document.

The final result is a completely converted document where the images are saved into my database and the resulting html where the images were located has been replaced with my special image reference string.

This is exactly what Aspose.Words already allows me, I’d say please confer with that team.
Hi Philip,

Thanks for sharing the details.

I have logged this requirement in our issue tracking system as PDFNEWNET-36070 in
our issue tracking system. We will further look into the details of this
problem and will keep you updated on the status of correction. Please be
patient and spare us little time. We are sorry for this inconvenience.

The issues you have found earlier (filed as PDFNEWNET-36070) have been fixed in Aspose.Pdf for .NET 8.9.0.


This message was posted using Notification2Forum from Downloads module by Aspose Notifier.

The issues you have found earlier (filed as PDFNEWNET-35692) have been fixed in Aspose.Pdf for .NET 8.9.1.


This message was posted using Notification2Forum from Downloads module by Aspose Notifier.

The issues you have found earlier (filed as PDFNEWNET-34748) have been fixed in Aspose.Pdf for .NET 9.1.0.

For further details, you may check this blog post.


This message was posted using Notification2Forum from Downloads module by Aspose Notifier.

Hi Philip,


Thanks for your patience. We are pleased to inform you that your requested feature of saving HTML output in stream object has been implemented. Please check following documentation link for the purpose.


Please feel free to contact us for any further assistance.

Best Regards,