PDF->HTML conversion lacks some links

Summary: A PDF file with links is converted to HTML code where some links are missing.

Context

A PDF file has one page with two lines, both links to text on the second page.

For best practice and to preserve confidential content in the initial document, the file was reduced to a minimal repro case. The issue was observed on the complete PDF as well as on the minimal one.

Observed

  • Line “1.2 Scope of application​” is converted to HTML links.

  • Line “2. Definitions” is converted to HTML spans.

Expected

  • Both lines are converted to HTML links.

Additional information

Generated HTML code below. Line “scope” is a link, line “definition” is not a link.


<div class="stl_03 stl_04">

<div class="stl_01" style="left:4.02em;top:15.6209em;"><a href="#2_2"><span class="stl_05 stl_06 stl_07" style="word-spacing:0.6245em;">1.2 Scope</span></a><a href="#2_2"><span class="stl_05 stl_06 stl_08" style="word-spacing:-0.0018em;">&nbsp;</span></a><a href="#2_2"><span class="stl_05 stl_06 stl_09" style="word-spacing:0.0003em;">of application:......................................................................................................................................................................................................................3 &nbsp;</span></a></div>

<div class="stl_01" style="left:2.6em;top:17.6209em;"><span class="stl_10 stl_06 stl_08">2</span></div>

<div class="stl_01" style="left:4.02em;top:17.6209em;"><span class="stl_10 stl_06 stl_11">Definition</span><span class="stl_10 stl_06 stl_12">s</span><span class="stl_10 stl_06 stl_13">.</span><span class="stl_10 stl_06 stl_14">............................................................................................................................................................................................................................................3 &nbsp;</span></div>

</div>

See the corresponding link declarations in the PDF file, they don’t show a significant difference:


12 0 obj

<< /BS << /W 0 >> /Dest [ 6 0 R /XYZ 28 401 0 ] /F 4 /Rect [ 45.95 395.82 795.29 413.89 ] /StructParent 3 /Subtype /Link >>

endobj

13 0 obj

<< /BS << /W 0 >> /Dest [ 6 0 R /XYZ 28 218 0 ] /F 4 /Rect [ 28.95 371.75 795.29 395.82 ] /StructParent 4 /Subtype /Link >>

endobj

The bug was reproduced on Aspose.PDF 24.10.0 and Aspose.PDF 25.6.0.

@abilger

It appears that during the conversion of your PDF file to HTML, some links are not being preserved as expected. Specifically, while one line is correctly converted to an HTML link, another line is converted to a span, which does not retain the hyperlink functionality.

To troubleshoot this issue, consider the following steps:

  1. Check Link Properties: Ensure that the link properties in the PDF are correctly defined. The link should have a valid destination and be properly formatted in the PDF structure.

  2. Review Conversion Options: When converting PDFs to HTML using Aspose.PDF, you can specify various options that might affect how links are handled. For example, you can use the HtmlSaveOptions class to configure the conversion process. Here’s a basic example of how to set up the conversion:

    // Load the PDF document
    Document pdfDocument = new Document("path/to/your/document.pdf");
    
    // Create HtmlSaveOptions
    HtmlSaveOptions saveOptions = new HtmlSaveOptions();
    // Set any specific options you may need
    saveOptions.ConvertMarkedContentToLayers = true; // Example option
    
    // Save the document as HTML
    pdfDocument.Save("path/to/output.html", saveOptions);
    
  3. Update Aspose.PDF Version: If you are using an older version of Aspose.PDF (like 24.10.0 or 25.6.0), consider updating to the latest version. There may have been bug fixes or improvements related to link handling in newer releases.

  4. Test with Different PDFs: If possible, test the conversion with different PDF files to determine if the issue is specific to the PDF you are using or if it is a broader issue with the conversion process.

  5. Consult Documentation: Refer to the Aspose.PDF documentation for detailed information on the HtmlSaveOptions class and other relevant settings that may influence link conversion.

By following these steps, you should be able to identify the cause of the missing links and find a suitable solution. If the problem persists, consider reaching out to Aspose support for further assistance.

Sources:
[1]: Convert PDF file to HTML format - Aspose Documentation
[2]: Class HtmlSaveOptions | Aspose.PDF for .NET API Reference

I didn’t expect a robot to answer. Anyway:

  1. Already done in the initial bug report. Tested with PDF reader, links are okay.
  2. I have reviewed conversion options, none seem relevant. Given that some links are okay in the output, the issue appears to be on link detection, not output.
  3. Already done in the initial bug report, mentioned two versions of Aspose, including the latest one.
  4. We handle many PDF documents here. Issue does not happen on all PDFs. This PDF must be handled correctly, else it is a bug.
  5. I have reviewed conversion options, none seem relevant. Given that some links are okay in the output, the issue appears to be on link detection, not output.

Here is the reproduction file:
pdf_to_html_conversion_lacks_some_links.pdf (195.3 KB)

@abilger

Would you also kindly share the sample code snippet for our reference that you have used to convert the PDF into HTML? We will test the scenario in our environment and address it accordingly.

Thanks @asad.ali for your answer. The code snippet is pretty basic.

namespace AsposePdfToHtml
{
    internal class Program
    {
        static void Main(string[] args)
        {
            string inFile = args.Length > 0 ? args[0] : "input.pdf";
            string outFile = args.Length > 1 ? args[1] : "output.html";

            Console.WriteLine($"Converting '{inFile}' to '{outFile}'...");

            Aspose.Pdf.License l = new Aspose.Pdf.License();
            string License = "...redacted...";
            using (var licenseStream = new System.IO.MemoryStream(System.Text.Encoding.UTF8.GetBytes(License)))
            {
                l.SetLicense(licenseStream);
            }
            Aspose.Pdf.Document pdfDocument = new Aspose.Pdf.Document(inFile);
            pdfDocument.Save(outFile, Aspose.Pdf.SaveFormat.Html);
            Console.WriteLine("PDF converted to HTML successfully.");
        }
    }
}

@abilger

We have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.

Issue ID(s): PDFNET-60127

You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.