PDF to HTML - incorrect DOM ordering making text selections impossible

Hi,

I’m currently trialling Aspose PDF to convert PDFs into fixed layout HTML documents.

The order in which the elements have been inserted in the HTML document does not allow sensible selections to be made. The elements seem to be inserted in some kind of top to bottom order, disregarding any column type layout.

Please see attached screenshot which shows a sensible selection in a PDF viewer and the incorrect selection in the HTML document produced by Aspose:

Screenshot 2022-11-09 at 12.08.22.jpg (479.7 KB)

PDF input: input.pdf (395.4 KB)

HTML output: output.html.zip (2.3 MB)

HtmlSaveOptions:

var options = new HtmlSaveOptions
{
    DocumentType = HtmlDocumentType.Xhtml,
    FixedLayout = true,
    SplitIntoPages = false,
    UseZOrder = true,
    PartsEmbeddingMode = HtmlSaveOptions.PartsEmbeddingModes.EmbedAllIntoHtml,
    RasterImagesSavingMode = HtmlSaveOptions.RasterImagesSavingModes.AsEmbeddedPartsOfPngPageBackground,
    ImageResolution = 100,
    CompressSvgGraphicsIfAny = true,
    SaveFullFont = true,
};

Thanks in advance.

@njlgad

Could you please share some more detail about your issue? We have tested the scenario using the latest version of Aspose.PDF for .NET 22.10 and have not found any issue with output HTML. Please check the attached output HTML.
output 22.10.zip (2.3 MB)

Hi @tahir.manzoor,

As far as I can tell, your output document exhibit the same problem as the one I generated.

Please see screenshot attached: Screenshot 2022-11-09 at 16.48.40.jpg (731.8 KB)

The selection starting from "Et a aut et excearibus quae non pore " and ending with "Pudant voluptatat. Giatist istios " should not include the paragraph which is on the right side of the page

@

We have logged this problem in our issue tracking system as PDFNET-52926. You will be notified via this forum thread once this issue is resolved.

We apologize for your inconvenience.

Hi @tahir.manzoor,

Thanks for confirming there is an issue with the HTML conversion.
This is a blocker for us into adopting Aspose PDF, so we hope this can be resolved soon.

@njlgad

We will inform you once there is an update available on on this issue.

1 Like