Highlight text in PDF Document and convert each page of document into html string

Hai,

My requiremnt is I wanted to highlight certain paragaraphs in a pdf document and convert each page of highlighted document into html string.
I did the highlighting portion, and its working
But when it comes to converting each pdf document into html string and viewing it in a browser, it is not working as expected.

Here I am sharing the code I have used to convert pdf document to html page as I have already done pdf highlighting and the code is bit more lengthy, I am not sharing it here.

byte[] byteData = null;
int pageCount = doc.Pages.Count;
for (int page = 0; page < pageCount; page++)
//foreach (Page page in pdfFile.Pages)
{
using (MemoryStream pageStream = new MemoryStream())
{
// Save each page as a separate document.
//Page extractedPage = page;

                Aspose.Pdf.Document extractedPage = new Aspose.Pdf.Document();
                extractedPage.Pages.Add(doc.Pages[page + 1]);
                HtmlSaveOptions htmlOptions = new HtmlSaveOptions();

                htmlOptions.FixedLayout = true;
                htmlOptions.PartsEmbeddingMode = Aspose.Pdf.HtmlSaveOptions.PartsEmbeddingModes.EmbedAllIntoHtml;
                htmlOptions.RasterImagesSavingMode = Aspose.Pdf.HtmlSaveOptions.RasterImagesSavingModes.AsEmbeddedPartsOfPngPageBackground;
                htmlOptions.RemoveEmptyAreasOnTopAndBottom = true;
                htmlOptions.SplitIntoPages = false;
                htmlOptions.SplitCssIntoPages = false;
                string cssprefix = "aspose_pdf" + page;
                htmlOptions.CssClassNamesPrefix = cssprefix;
                //htmlOptions.HtmlMarkupGenerationMode = Aspose.Pdf.HtmlSaveOptions.HtmlMarkupGenerationModes.WriteAllHtml;

                extractedPage.Save(pageStream, htmlOptions);
                //pdfFile.Save(pageStream, htmlOptions);
                var pageBytes = pageStream.ToArray();


                if ((pageNumber == 0) & (page == 0))
                {
                    byteData = pageBytes;
                }
                if (pageNumber == page + 1)
                {
                    byteData = pageBytes;
                }

               

            }
        }
        string HtmlString = byteData.ProcessHtml();

And when viewing HtmlString in browser, output I am getting is

output.PNG (62.8 KB)

May I know why this happens? Only the highlighted color can be seen not text,
sample document is :

whitepaper.pdf (335.7 KB)

@pooja.jayan

Can you please share the PDF document in which you have highlighted the paragraphs? We will convert it into HTML in our environment and share our feedback with you.

Hai,
Sample pdf document is whitepaper.pdf (335.7 KB)

I have used the following code for highlighting

             HighlightAnnotation ha = new HighlightAnnotation(item.Page, item.Rectangle);
                    
                    ha.Color = Color.Yellow;
                    
                    pdfFile.Pages[item.Page.Number].Annotations.Add(ha);

The result I am getting is somewhat like this:
result.PNG (32.4 KB)

@pooja.jayan

In the shared PDF document, we could not see the highlighted text. Please share the PDF that you obtained after highlighting paragraphs.

Hai,
This is the highlighted PDF Document
PDF_with_Highlighted_Text.pdf (299.6 KB)

@pooja.jayan

We have tested the scenario in our environment and were able to notice the similar issue. We have logged it as PDFNET-50941 in our issue management system. We will further look into its details and keep you posted with the status of its rectification. Please be patient and spare us some time.

We are sorry for the inconvenience.

Hai,

Any updates?

Also, May I know the reason behind this? Why there is no highlighting, when a highlighted paragraph after being saved gets converted to html, and viewed in browser, text disappears and remaining only the highlighted colour.

@pooja.jayan

It seems like the API is adding a CSS Property to the text i.e. visibility: none; while converting the PDF to HTML. This is why the text is not visible. We will further investigate this behavior of the API and let you know as soon as it is fixed.

Hai,
Thank you for your response.

Actually I want to know one more thing, What is this annotation. Flatten() method, and what is it used for?

when I add highlightAnnotation.Flatten() method, I don’t know whether it is because of that or not, text is not getting splitted into two . But it does not work in all cases.

@pooja.jayan

Flatten() method removes or disable the annotation i.e. they become inaccessible and user cannot further select them while opening the PDF in any viewer. Similarly, Document.Flatten() method removes the form fields and place their values at the same location instead.

Hai,

Thankyou for your response.

Any update on the issue I mentioned earlier? To remove or prevent API from adding visibility:hidden style to span element. Can we prevent this behaviour with any of the HtmlSaveOptions available?

@pooja.jayan

We are afraid that the earlier logged ticket is not yet analyzed and it is pending for review. Please note that it has been logged under free support model and will be investigated and resolved on a first come first serve basis. We will surely let you know as soon as we make some definite progress towards ticket resolution. Please spare us some time.

We are sorry for the inconvenience.

Hai,

Any updates??

@pooja.jayan

We are afraid that there are no updates at the moment about ticket resolution. We will surely investigate and resolve it after clearing the issues logged prior to it and let you know in this forum thread when additional updates are available. We highly appreciate your patience in this matter.

We are sorry for the inconvenience.

Hai,

Any updates? How long will it take?

@pooja.jayan

We are afraid that we are not in a position to share some reliable ETA. Please note that - as shared earlier - the issues are resolved on a first come first serve basis in the free support model. The resolution time of the issue depends upon its nature and complexity as well as the number of the issues logged prior to it unlike the priority support where issues are dealt with precedence. We will surely inform you once we have certain news about ticket resolution.

We apologize for the inconvenience.