Highlight text in PDF Document and convert each page of document into html string

pooja.jayan · November 18, 2021, 12:33pm

Hai,

My requiremnt is I wanted to highlight certain paragaraphs in a pdf document and convert each page of highlighted document into html string.
I did the highlighting portion, and its working
But when it comes to converting each pdf document into html string and viewing it in a browser, it is not working as expected.

Here I am sharing the code I have used to convert pdf document to html page as I have already done pdf highlighting and the code is bit more lengthy, I am not sharing it here.

byte[] byteData = null;
int pageCount = doc.Pages.Count;
for (int page = 0; page < pageCount; page++)
//foreach (Page page in pdfFile.Pages)
{
using (MemoryStream pageStream = new MemoryStream())
{
// Save each page as a separate document.
//Page extractedPage = page;

                Aspose.Pdf.Document extractedPage = new Aspose.Pdf.Document();
                extractedPage.Pages.Add(doc.Pages[page + 1]);
                HtmlSaveOptions htmlOptions = new HtmlSaveOptions();

                htmlOptions.FixedLayout = true;
                htmlOptions.PartsEmbeddingMode = Aspose.Pdf.HtmlSaveOptions.PartsEmbeddingModes.EmbedAllIntoHtml;
                htmlOptions.RasterImagesSavingMode = Aspose.Pdf.HtmlSaveOptions.RasterImagesSavingModes.AsEmbeddedPartsOfPngPageBackground;
                htmlOptions.RemoveEmptyAreasOnTopAndBottom = true;
                htmlOptions.SplitIntoPages = false;
                htmlOptions.SplitCssIntoPages = false;
                string cssprefix = "aspose_pdf" + page;
                htmlOptions.CssClassNamesPrefix = cssprefix;
                //htmlOptions.HtmlMarkupGenerationMode = Aspose.Pdf.HtmlSaveOptions.HtmlMarkupGenerationModes.WriteAllHtml;

                extractedPage.Save(pageStream, htmlOptions);
                //pdfFile.Save(pageStream, htmlOptions);
                var pageBytes = pageStream.ToArray();


                if ((pageNumber == 0) & (page == 0))
                {
                    byteData = pageBytes;
                }
                if (pageNumber == page + 1)
                {
                    byteData = pageBytes;
                }

               

            }
        }
        string HtmlString = byteData.ProcessHtml();

And when viewing HtmlString in browser, output I am getting is

output.PNG (62.8 KB)

May I know why this happens? Only the highlighted color can be seen not text,
sample document is :

whitepaper.pdf (335.7 KB)

asad.ali · November 18, 2021, 7:12pm

@pooja.jayan

Can you please share the PDF document in which you have highlighted the paragraphs? We will convert it into HTML in our environment and share our feedback with you.

pooja.jayan · November 19, 2021, 4:11am

Hai,
Sample pdf document is whitepaper.pdf (335.7 KB)

I have used the following code for highlighting

             HighlightAnnotation ha = new HighlightAnnotation(item.Page, item.Rectangle);
                    
                    ha.Color = Color.Yellow;
                    
                    pdfFile.Pages[item.Page.Number].Annotations.Add(ha);

The result I am getting is somewhat like this:
result.PNG (32.4 KB)

asad.ali · November 19, 2021, 5:00pm

@pooja.jayan

In the shared PDF document, we could not see the highlighted text. Please share the PDF that you obtained after highlighting paragraphs.

pooja.jayan · November 19, 2021, 5:18pm

Hai,
This is the highlighted PDF Document
PDF_with_Highlighted_Text.pdf (299.6 KB)

asad.ali · November 20, 2021, 7:32pm

@pooja.jayan

We have tested the scenario in our environment and were able to notice the similar issue. We have logged it as PDFNET-50941 in our issue management system. We will further look into its details and keep you posted with the status of its rectification. Please be patient and spare us some time.

We are sorry for the inconvenience.

pooja.jayan · November 22, 2021, 6:10am

Hai,

Any updates?

Also, May I know the reason behind this? Why there is no highlighting, when a highlighted paragraph after being saved gets converted to html, and viewed in browser, text disappears and remaining only the highlighted colour.

asad.ali · November 22, 2021, 4:42pm

@pooja.jayan

It seems like the API is adding a CSS Property to the text i.e. visibility: none; while converting the PDF to HTML. This is why the text is not visible. We will further investigate this behavior of the API and let you know as soon as it is fixed.

pooja.jayan · November 23, 2021, 9:36am

Hai,
Thank you for your response.

Actually I want to know one more thing, What is this annotation. Flatten() method, and what is it used for?

when I add highlightAnnotation.Flatten() method, I don’t know whether it is because of that or not, text is not getting splitted into two . But it does not work in all cases.

asad.ali · November 23, 2021, 4:41pm

@pooja.jayan

Flatten() method removes or disable the annotation i.e. they become inaccessible and user cannot further select them while opening the PDF in any viewer. Similarly, Document.Flatten() method removes the form fields and place their values at the same location instead.

pooja.jayan · November 24, 2021, 6:35am

Hai,

Thankyou for your response.

Any update on the issue I mentioned earlier? To remove or prevent API from adding visibility:hidden style to span element. Can we prevent this behaviour with any of the HtmlSaveOptions available?

asad.ali · November 24, 2021, 4:59pm

@pooja.jayan

We are afraid that the earlier logged ticket is not yet analyzed and it is pending for review. Please note that it has been logged under free support model and will be investigated and resolved on a first come first serve basis. We will surely let you know as soon as we make some definite progress towards ticket resolution. Please spare us some time.

We are sorry for the inconvenience.

pooja.jayan · November 30, 2021, 2:01pm

Hai,

Any updates??

asad.ali · November 30, 2021, 6:31pm

@pooja.jayan

We are afraid that there are no updates at the moment about ticket resolution. We will surely investigate and resolve it after clearing the issues logged prior to it and let you know in this forum thread when additional updates are available. We highly appreciate your patience in this matter.

We are sorry for the inconvenience.

pooja.jayan · December 6, 2021, 5:10am

Hai,

Any updates? How long will it take?

asad.ali · December 6, 2021, 5:52pm

@pooja.jayan

We are afraid that we are not in a position to share some reliable ETA. Please note that - as shared earlier - the issues are resolved on a first come first serve basis in the free support model. The resolution time of the issue depends upon its nature and complexity as well as the number of the issues logged prior to it unlike the priority support where issues are dealt with precedence. We will surely inform you once we have certain news about ticket resolution.

We apologize for the inconvenience.