Hi,
- We have a need to create html from pdf which has to retain the formats. For example, bulletins, tables, headings has to be retained as it is in html like in “ul,ol,li,table,td,h1,h2” tags. How do we achieve this with aspose? Can this be done without post processing of pdf to html? If this is to be done via post processing, how can we identify the bulletins(ordered and unordered) and other html formats?
- While creating html from pdf, If the pdf has Table of Contents, we expect clicking on the TOC header to navigate to the respecting page. how can this be achieved?
We will be using .net latest packages.
@starlindanielraj
Cause
To convert a PDF to HTML while retaining formatting such as bullet points, tables, and headings, you can use Aspose.PDF for .NET. The library provides options to control how the HTML is generated, ensuring that the structure of the original PDF is maintained.
Solution
-
HTML Conversion with Formatting: Achieve the desired formatting in the HTML output by using the HtmlSaveOptions
class. This allows you to specify how elements like lists and tables should be represented in the HTML.
-
Table of Contents Navigation: For the Table of Contents (TOC) to function correctly, utilize bookmarks in the PDF. When converting, ensure that the bookmarks are preserved, allowing for navigation in the generated HTML.
Code
Here’s a sample code snippet to convert a PDF to HTML while retaining formatting and enabling TOC navigation:
using Aspose.Pdf;
private static void ConvertPDFtoHTML()
{
// Load the PDF document
Document pdfDocument = new Document("input.pdf");
// Create HtmlSaveOptions
HtmlSaveOptions htmlSaveOptions = new HtmlSaveOptions
{
// Enable splitting into multiple pages
SplitIntoPages = true,
// Ensure that bookmarks are included for TOC navigation
ExportBookmarks = true
};
// Save the document as HTML
pdfDocument.Save("output.html", htmlSaveOptions);
}
Next Steps
- Run the provided code in your .NET environment to convert your PDF.
- Check the generated HTML to ensure that the formatting and TOC navigation work as expected.
- If you need to identify specific elements like bullet points or tables for further processing, implement additional logic to parse the HTML after conversion.
This approach will effectively convert your PDF to HTML while preserving the necessary formatting and functionality. If you have any further questions or need assistance, feel free to ask!
The response looks like AI generated and its not correct. ExportBookmarks is not available and we use Aspose.PDF, Version=25.7.0.0. We tried HtmlSaveOptions but we get a list item as div and span tags instead of li,ol,ul,
Sample one : div class=“stl_01” style=“left:8.6927em;top:48.8725em;” span class=“stl_45 stl_15 stl_13” style=“word-spacing:0.3723em;”• Reorient/span span class=“stl_45 stl_15 stl_09” style=“word-spacing:0.0005em;” span span class=“stl_45 stl_15 stl_13” style="word-spacing:0.0002em;"or relocate the receiving antenna. span div
@starlindanielraj
These requirements need to be investigated in details. Can you please share your sample PDF document with expected output HTML for our reference? We will log an investigation ticket and share the ID with you.
Documentation - [ 11_4_1 Integration testing ].pdf (724.4 KB)
Documentation - [ 11_4_1 Integration testing ].pdf-aspose.zip (3.2 KB)
PFAs pdf and html(zipped). This html is generated via aspose and look wise its good. But appropriate html tags are missing. For instance, the html has list items as span tags instead of li,ul,ol. @asad.ali
@starlindanielraj
We have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.
Issue ID(s): PDFNET-60297
You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.