PDF to HTML do not render text left to right line by line

I am converting pdf to html using below code:


string pdfHTML = “”;

protected void PDFtoHTMLStream(string file)
{
//Document doc = new Document(@“D:\PDF\input.pdf”);
Document doc = new Document(file);
Page page = doc.Pages[1];
Document newDoc = new Document();
newDoc.Pages.Add(page);

// tune conversion params
HtmlSaveOptions newOptions = new HtmlSaveOptions();
newOptions.RasterImagesSavingMode = HtmlSaveOptions.RasterImagesSavingModes.AsEmbeddedPartsOfPngPageBackground;
newOptions.FontSavingMode = HtmlSaveOptions.FontSavingModes.SaveInAllFormats;
newOptions.PartsEmbeddingMode = HtmlSaveOptions.PartsEmbeddingModes.EmbedAllIntoHtml;
newOptions.LettersPositioningMethod = HtmlSaveOptions.LettersPositioningMethods.UseEmUnitsAndCompensationOfRoundingErrorsInCss;
newOptions.SplitIntoPages = false;// force write HTMLs of all pages into one output document
newOptions.SplitCssIntoPages = false;
newOptions.DocumentType = HtmlDocumentType.Xhtml;

newOptions.CustomHtmlSavingStrategy = new HtmlSaveOptions.HtmlPageMarkupSavingStrategy(SavingToStream);
//we can use some non-existing puth as result file name - all real saving will be done
//in our custom method SavingToStream() (it’s follows this one)
string outHtmlFile = @“Z:\SomeNonExistingFolder\SomeUnexistingFile.html”;
newDoc.Save(outHtmlFile, newOptions);

}

protected void SavingToStream(HtmlSaveOptions.HtmlPageMarkupSavingInfo htmlSavingInfo)
{
byte[] resultHtmlAsBytes = new byte[htmlSavingInfo.ContentStream.Length];
htmlSavingInfo.ContentStream.Read(resultHtmlAsBytes, 0, resultHtmlAsBytes.Length);
pdfHTML = System.Text.Encoding.UTF8.GetString(resultHtmlAsBytes);
}

When the HTML render on UI it looks fine but if I read text from javascript it do not read in correct order. If I convert pdf to simple text it render text in correct order (i.e. line by line and left to right) but if I read text from html using javascript I see the order of text has been change.
I am using nextElementSibling.innerText to get next div value.

Is there any way so I can read text from html same way the pdf to text work (i.e. line by line and left to right)?

Hi Vatsal,

Thanks for your inquiry. I am afraid Aspose.Pdf does not support HTML text manipulation. You can google it for the solution. Please check this [blog post to search text within HTML](https://blog.codecentric.de/en/2013/11/javascript-search-text-html-page/), hopefully it will help you to accomplish the task.

Best Regards,