PDF To HTML generating different HTML VERSION- 11.2.0.0

dsp0109 · June 15, 2017, 4:44am

We are facing issue in PDF to HTML conversion; bellow is the behavior that we have observed. In multi threaded environment when we try to Convert PDF to HTML, Aspose.pdf generating different HTML. Bellow is the line of code we were using to convert PDF to HTML

var pdfDocument = new Document(filecopyMemoryStream);

lock (pdfDocument)

{

var textAbsorber = new TextAbsorber();

pdfDocument.Pages.Accept(textAbsorber);

pdfDocument.Save(targetHtmlFilePath, SaveFormat.Html);

}

Can anyone please help us, where are we going wrong? Or is it know issue? It sometimes break a line in pdf into span and sometimes not. We have also tried using bellow HtmlSaveOptions formatter

new HtmlSaveOptions

{

FixedLayout = true,

}

imran.rafique · June 15, 2017, 5:29am

Hi,

Thank you for contacting support. Did you find the difference in results of non multi-threaded and multi-threaded environments? If so, then please elaborate the difference and kindly also share a source PDF with us. We will investigate and share our findings with you. Your response is awaited.

dsp0109 · June 15, 2017, 5:41am

Yes… In non multi threaded we are getting similar HTML most of the times but in Multi threaded it always creates different HTML.

Bellow is a section from RUN 1:

a section of PDF is generated as

"

2076.70 125.30

526.10

916.55

647.10

293.60

1.70

"

and same section from RUN2:

"

2076.70 125.30 526.10 916.55 647.10 293.60 1.70

"

So we are uploading a bunch of 5 different PDF which start converting to HTML. Same bunch is uploaded again and start converting to HTML in parallel for all 5 files. So the HTML’s generated for each files differs in structure like above.

We can’t provide you PDF files. But to give you a brief of structure, they mostly contain data in tabular format.

dsp0109 · June 15, 2017, 9:46am

We tried to create sample example to check this and bellow is the code for sample application which reads PDF from a folder and convert it into HTML. Also find sample PDF’s

Check for the file named - ast_sci_data_tables_sample.pdf output

Console Code: ASPOSE,pdf dll version 11.2.0.0

using System;

using System.Collections.Generic;

using System.IO;

using System.Linq;

using System.Text;

using System.Threading.Tasks;

using Aspose.Pdf;

using Aspose.Pdf.Text;

namespace ASPOSE_TEST

{

class Program

{

static void Main(string[] args)

{

var folderPath = @“D:\DNYANESHWAR\COUPLE OF NORMAL PDF”;

var files = Directory.GetFiles(folderPath);

var license = new License();

license.SetLicense(“Aspose.Pdf.lic”);

for (var i = 0; i < 2; i++)

{

Parallel.ForEach(files, file =>

{

string targetHtmlFilePath = $“…/…/TO_HTML/{Guid.NewGuid() + “_” + Path.GetFileNameWithoutExtension(file)}.html”;

var fileData = new MemoryStream(File.ReadAllBytes(file));

var pdfDocument = new Document(fileData);

var textAbsorber = new TextAbsorber();

pdfDocument.Pages.Accept(textAbsorber);

pdfDocument.Save(targetHtmlFilePath, SaveFormat.Html);

});

}

imran.rafique · June 15, 2017, 4:13pm

Hi,

Thank you for posting PDF documents and code. We have tested your scenario with the latest version 17.6 of Aspose.Pdf for .NET API and could not find the issues of line breaks. We have attached a Zip of output HTML documents to this reply. Please check and let us know if you can find the said issues. Your response is awaited.

dsp0109 · June 16, 2017, 12:18am

The version we are using is VERSION- 11.2.0.0 which we have already mentioned in Header description itself. We don’t want to move it to latest as there are lot of dependencies that we do need to recheck then.

We are actually parsing PDF to HTML and extracting data from HTML with sort of string manipulation, so we don’t want it be updated as it will create a whole testing loop for us, as data being processed is huge.

Also find Attached output for ast_sci_data_tables_sample.pdf file.

And the output you have provided We can only see one file for one PDF, can you provide results for multiple run?

imran.rafique · June 16, 2017, 5:22am

Hi,

Thank you for the details. We have attached a Zip of HTML documents with two outputs for each PDF to this reply (generated by API version 17.6: OutputHTMLs17.6.zip). There might be a bug in the old version 11.2.0.0 of Aspose.Pdf for .NET API and we recommend our clients always use the latest version of the API. We do not provide fixes in the old code base. We have also generated the output HTML documents with the old version 11.2.0.0 (download link: OutputHTMLs11.2.0.0.zip), kindly highlight issues with the help of snapshots. Your response is awaited.

dsp0109 · June 16, 2017, 5:43am

Yes please find attachment screenshots. We tried to compare files in WinMerge and you can see visible difference in conversion from PDF TO HTML. Can you please crosscheck from your side if it is present with newer build?

imran.rafique · June 16, 2017, 3:12pm

Hi,

Thank you for the details. You are comparing two outputs of a single PDF and the content is different. We can notice this problem with old API version 11.2.0.0 of Aspose.Pdf for .NET API. However, the latest version 17.6 generates two same outputs in the multiple iterations of code. To cover this defect, we would recommend you please upgrade the old API version 11.2.0.0 to the latest version 17.6 of Aspose.Pdf for .NET API.