Convert PDF TO HTML and extract those HTML

HI Team, I need your help in achieving the below scenario.
The agenda is to convert the PDF to HTML. I have converted the PDF to html using Aspose.PDF. Below is the code in c#

Document pdf = new Document(new MemoryStream(convertRequest.FileData));
HtmlSaveOptions conversionOptions = new HtmlSaveOptions(HtmlDocumentType.Html5, true);

conversionOptions.PartsEmbeddingMode = PartsEmbeddingModes.EmbedAllIntoHtml;
conversionOptions.LettersPositioningMethod = LettersPositioningMethods.UseEmUnitsAndCompensationOfRoundingErrorsInCss;
conversionOptions.RasterImagesSavingMode = RasterImagesSavingModes.AsEmbeddedPartsOfPngPageBackground;
pdf.Save(outStream, conversionOptions);
outStream.Position = 0;
StreamReader reader = new StreamReader(outStream);
string convertedHtml = reader.ReadToEnd();
response = convertedHtml;

But since the converted HTML are a group of div’s and span. We don’t have any unique identifier to access each div. (How to get unique id’s).All the div have same className.

Once converted I need to do some manipulation in html
1.Need to highlight / draw a bounding box around a particular para/text (Usage: In UI the user can search a text I need to draw a bounding box for the containing section)
2.How to add Page Number to each page, though the pdf is having page number , not able to access the page with page Number, so I need to add them in a custom manner (or any HTML fragment we can manipulate if we can add unique id’s in DOM). [NOTE: I tried adding html fragment and saving as pdf and then converting to html, But those tags are hidden in the converted HTML]
3.How to extract the TOC or any paragraph content from the HTML .

Thanks in Advance.

@Vijayalakshmisridharan

Thanks for contacting support. Please note that Aspose.PDF is specialized to deal with the PDF documents. It does not have any abilities to handle or manipulate HTML files. You may need to use another library to manipulate or modify the HTML document.

If you have Aspose.HTML license as well, we can log an investigation ticket in our issue tracking system to investigate the feasibility of your scenario using this API. Can you please share your sample PDF document along with the obtained and expected output HTML? We will investigate whether your requirements are feasible to achieve or not.

Hi Asad Ali,

Please find the sample PDF file and the obtained HTML files converted using Aspose.PDF. We do have license for all ASPOSE Products.

1.Expected html should contain unique id’s for all div’s.

2.Page number at the bottom of each page.

3.Extract the Table of Content from the PDF.

Thanks,

VIJAYALAKSHMI

Apple_10k_2022.pdf (712 KB)

(Attachment Apple_html.html is missing)

Hi,

PFA.

Thanks,

VIJAYALAKSHMI

Apple.zip (721 KB)

@Vijayalakshmisridharan

Thanks for sharing the sample files. We have logged an investigation ticket as HTMLNET-4918 in our issue tracking system to investigate whether your requirements can be achieved using Aspose.HTML or not. We will further look into its details and keep you posted with the status of ticket resolution. Please be patient and spare us some time.

Hi Asad Ali,

Any update on the ticket. It would be helpful if you could let me know the progress.

Thanks,

VIJAYALAKSHMI

@Vijayalakshmisridharan

The ticket has recently been logged and it is about a feature investigation. It will be analyzed and resolved on a first come first serve basis. As soon as we make some progress towards its resolution, we will update you via this forum thread. Please spare us some time.

We are sorry for the inconvenience.

@Vijayalakshmisridharan

Please check the below code snippet if it fulfills your requirements:

using Aspose.Html;
using Aspose.Html.Converters;
using Aspose.Html.Dom;
using Aspose.Html.Dom.Traversal.Filters;
using Aspose.Html.Rendering.Image;
using Aspose.Html.Saving;
using Aspose.Html.Services;

namespace ExampleFindText;

internal class Program
{
    private static void Main(string[] args)
    {
        using (var configuration = new Configuration())
        {
            // Get the User Agent service
            var userAgent = configuration.GetService<IUserAgentService>();

            // Set the style of custom margins and create marks on it
            //This will show the page number on each page when rendering using the library Aspose.Html

            userAgent.UserStyleSheet = @"@page 
                                {
                                    /* Page margins should be not empty in order to write content inside the margin-boxes */
                                    margin-top: 1cm;
                                    margin-left: 2cm;
                                    margin-right: 2cm;
                                    margin-bottom: 2cm;
                                    /* Page counter located at the bottom of the page */
                                    @bottom-right
                                    {
                                        -aspose-content: ""Page "" currentPageNumber() "" of "" totalPagesNumber();
                                        color: green;
                                    }

                                    @bottom-left 
                                    {
                                        content: ""Page "" counter(page) "" of "" counter(pages);
                                    }
}
                                }";


            // Initialize a document based on the prepared code
            using (var document = new HTMLDocument("c:\\temp\\Apple_html.html", configuration))
            {
                // To start HTML navigation we need to create an instance of TreeWalker.
                // The specified parameters mean that it starts walking from the root of the document, iterating all nodes
                using (var iterator = document.CreateTreeWalker(document, NodeFilter.SHOW_ALL))
                {
                    int id = 0;
                    while (iterator.NextNode() != null)
                    {
                        var currentNode = iterator.CurrentNode;
                        if (currentNode.LocalName == "div")
                        {
                            //Set the unique id’s for all div’s.
                            ((HTMLDivElement)currentNode).SetAttribute("id", id.ToString());
                        }

                        id++;
                    }
                }

                using (var iterator = document.CreateTreeWalker(document, NodeFilter.SHOW_TEXT))
                {
                    while (iterator.NextNode() != null)
                    {
                        var currentNode = iterator.CurrentNode;

                        //We are looking for the node we need containing the text we are looking for
                        if (currentNode.TextContent.Contains("TABLE OF CONTENTS"))
                        {
                            var parent = currentNode.ParentElement;
                            while (parent.HasAttributes() && !(parent.Attributes["class"].TextContent == "stl_view"))
                            {
                                parent = parent.ParentElement;
                            }

                            //We highlight this node by setting a red border around the parent node
                            //The final highlight method depends on the styles and structure of a particular document
                            parent.ParentElement.SetAttribute("style", "border: 1px solid red;");
                        }
                    }
                }

                using (var iterator = document.CreateTreeWalker(document, NodeFilter.SHOW_TEXT))
                {
                    while (iterator.NextNode() != null)
                    {
                        var currentNode = iterator.CurrentNode;

                        //We are looking for the node containing the TOC
                        if (currentNode.TextContent.Contains("TABLE OF CONTENTS"))
                        {
                            var parent = currentNode.ParentElement;
                            while (parent.HasAttributes() && !(parent.Attributes["class"].TextContent == "stl_view"))
                            {
                                parent = parent.ParentElement;
                            }

                            //parent is root div contained the TOC
                            var toc = parent.TextContent;
                        }
                    }
                }

                //When saving to html, styles containing page numbers are not saved to a file, the selection styles of the element containing the desired text will be saved.
                document.Save("c:\\temp\\testFindText.html");
                var opt = new ImageSaveOptions(ImageFormat.Jpeg);
                //When rendering to a different format, page numbers will be added
                Converter.ConvertHTML(document, opt, "c:\\temp\\testImage.png");
            }
        }
    }
}

In case you still notice any issues, please share some more details with expected output HTML so that we can further proceed with the investigation.

The issues you have found earlier (filed as HTMLNET-4918) have been fixed in this update. This message was posted using Bugs notification tool by avpavlysh