PDF to HTML Conversion is slow

surendra1986 · June 26, 2018, 6:17am

When processing larger size PDF i.e. 14 MB, process is not running smoothly, even it taking more than hours to complete.

Also Bookmarks are not being extracted as htmlTOC.

Is there any solution to this.

Also getting exception in one of the PDF while converting to HTML.AsposePDFError.png (16.4 KB)

Farhan.Raza · June 26, 2018, 10:09am

@surendra1986

Thank you for contacting support.

Would you please share relevant source and generated files by uploading them to Google Drive, Dropbox etc, along with the code snippet reproducing this problem. So that we may try to reproduce and investigate it in our environment. Moreover, we will appreciate if you can create separate post for the other issue as this help us to assist you efficiently.

surendra1986 · June 27, 2018, 4:28am

1_PDFsam_Law Practice and Procedure of Arbitration.pdf (170.5 KB)
Below is the code that we using. It producing exception only in this PDF.
Document doc = new Document(file);
//PDF optimization
doc.Optimize();
// Instantiate HTML Save options object
//HtmlSaveOptions newOptions = new HtmlSaveOptions();

        // Enable option to embed all resources inside the HTML
        //newOptions.PartsEmbeddingMode = HtmlSaveOptions.PartsEmbeddingModes.EmbedAllIntoHtml;

        //newOptions.PartsEmbeddingMode = HtmlSaveOptions.PartsEmbeddingModes.EmbedCssOnly;

        //newOptions.SplitIntoPages = false;
        //newOptions.FixedLayout = true;
        //This is just optimization for IE and can be omitted
        //newOptions.LettersPositioningMethod = Aspose.Pdf.HtmlSaveOptions.LettersPositioningMethods.UseEmUnitsAndCompensationOfRoundingErrorsInCss;
        //newOptions.RasterImagesSavingMode = HtmlSaveOptions.RasterImagesSavingModes.AsEmbeddedPartsOfPngPageBackground;
        //newOptions.FontSavingMode = HtmlSaveOptions.FontSavingModes.SaveInAllFormats;


        // Get the page at a particular index of the Page Collection
        //Page pdfPage = doc.getPages().get_Item(1);

        // Create a new Document object
        //Document newDocument = new Document();

        // Add the page to the Pages collection of new document object
        //newDocument.getPages().add(pdfPage);

        // Save the new file
        //newDocument.save("page_" + pdfPage.getNumber() + ".pdf");
        //doc.save("page3.htm", newOptions);


        //string imagesDir = Path.GetDirectoryName(file) + @"\images";
       // string svgimagesDir = Path.GetDirectoryName(file) + @"\svg";
        //if (Directory.Exists(imagesDir)) { Directory.Delete(imagesDir, true); };
        //Directory.CreateDirectory(imagesDir);
       // if (Directory.Exists(svgimagesDir)) { Directory.Delete(svgimagesDir, true); };
       // Directory.CreateDirectory(svgimagesDir);

        // 3) Tune conversion options
        HtmlSaveOptions options = new HtmlSaveOptions();
        options.FixedLayout = true;
        options.RasterImagesSavingMode = HtmlSaveOptions.RasterImagesSavingModes.AsPngImagesEmbeddedIntoSvg;

        options.PartsEmbeddingMode = HtmlSaveOptions.PartsEmbeddingModes.NoEmbedding;
        options.ExtractOcrSublayerOnly = true;

        //This is just optimization for IE and can be omitted
        options.LettersPositioningMethod = Aspose.Pdf.HtmlSaveOptions.LettersPositioningMethods.UseEmUnitsAndCompensationOfRoundingErrorsInCss;
        // Split HTML output into pages
        options.SplitIntoPages = true;

        // Specify to render PDF document layers separately in output HTML
        options.ConvertMarkedContentToLayers = true;
        options.SaveShadowedTextsAsTransparentTexts = true;
        options.SaveTransparentTexts = true;

        // Split css into pages
        options.SplitCssIntoPages = false;

        
        // 3) Do conversion
        doc.Save(Path.Combine(outputDir, Path.GetFileNameWithoutExtension(file.Replace(" ", "_")) + "_page.html"), options);

Farhan.Raza · June 27, 2018, 11:36am

@surendra1986

We have worked with the data shared by you and have been able to reproduce XmlException with description as: Unexpected end tag. Would you please verify the exception so that we may proceed further to help you out.

surendra1986 · June 28, 2018, 4:02am

Yes same exception we are getting. Please resolve or tell us what are the reasons…

Farhan.Raza · June 28, 2018, 10:44am

@surendra1986

Thank you for the clarification.

A ticket with ID PDFNET-44982 has been logged in our issue management system for further investigation and resolution. The ticket ID has been linked with this thread so that you will receive notification as soon as the ticket is resolved.

We are sorry for the inconvenience.

Farhan.Raza · June 28, 2018, 11:49am

A post was split to a new topic: Not getting attachment with link

Farhan.Raza · June 28, 2018, 11:48am

A post was split to a new topic: Unable to get bookmarks

aspose.notifier · January 7, 2020, 12:31am

The issues you have found earlier (filed as PDFNET-44982) have been fixed in Aspose.PDF for .NET 20.1.