Saving PDF to HTML (split into pages) generates more files than pages

bpetrea · July 30, 2018, 2:15pm

HI, is there a way to have Aspose.PDF generate as many html files as pages (considering that both SplitIntoPages i set to true and FixedLayout is set to true? Alternatively is there a way to have it generate a fixed layout with a separate file for each page?

Farhan.Raza · July 30, 2018, 6:28pm

@bpetrea

Thank you for contacting support.

You can load a PDF document, iterate through each page and convert it to an HTML file; as in the code snippet below:

Document sourcePDF = new Document(@"Test.pdf");
foreach (Aspose.Pdf.Page page in sourcePDF.Pages)
{
    Document newDocument = new Document();
    newDocument.Pages.Add(page);
    newDocument.Save(@"Page_" + page.Number + ".html", SaveFormat.Html);
}

We hope this will be helpful. Please feel free to contact us if you need any further assistance.

bpetrea · July 31, 2018, 7:34am

I will try it , thank you. Eventually i want to generate a fixed layout epub and hence i needed the output to be per page

Farhan.Raza · July 31, 2018, 11:50am

@bpetrea

Please take your time and test suggested approach in your environment. Please feel free to contact us if you need any further assistance.

bpetrea · August 6, 2018, 12:34pm

Hi, sorry for bothering but after exporting per page into html i noticed that the links aren’t exported(towards other pdf pages - i still need to check for external ones) . Is this a feature that isn’t supported or am I doing something wrong?

Later edit: Never mind, figured it out, will post it here for future reference:
You need to go through each of the page’s annotations which have GoToActions and change them into GoToURIActions:

                for (int linkCount = list.Count-1; linkCount >=0; linkCount--)
                {
                    LinkAnnotation a = list[linkCount] as LinkAnnotation;
                    
                    // Print the destination URL
                    if (a != null)
                    {
                        if (a.Action.ToString() == "Aspose.Pdf.Annotations.GoToAction")
                        {
                            a.Action = new GoToURIAction("Page_"+  page.Number + ".html");
                        }
                    }
                }

Farhan.Raza · August 6, 2018, 6:02pm

@bpetrea

Thank you for your kind feedback.

We are glad to know that things have started working in your environment. Please feel free to contact us if you need any further assistance and we will be more than happy to assist you.

bpetrea · August 7, 2018, 8:34am

HI, I’m back again (sorry for being a nuissance). I noticed that upon export to html the links are exported wrongfully (while the spans are corectly placed, the destinations are messed up -> in a TOC page 9 points to page 13, page 13 points to page 15, and so on, even though in the pdf are corectly placed.

example.jpg (196.5 KB)

Is there anything I can do to have the correct order of links ?

Kindest regards,
Bogdan

Farhan.Raza · August 7, 2018, 6:41pm

@bpetrea

Thank you for getting back to us.

Would you please share the source and generated files with us along with a narrowed down code snippet reproducing this issue so that we may investigate further to help you out.

bpetrea · August 8, 2018, 8:53am

@Farhan.Raza sure

Page_007.pdf (53.9 KB)

as for the code used :

            HtmlSaveOptions options = new HtmlSaveOptions();
            options.FixedLayout = true;
            options.SplitIntoPages = false;
            options.SplitCssIntoPages = false;
            options.CompressSvgGraphicsIfAny = false;
            options.RasterImagesSavingMode = HtmlSaveOptions.RasterImagesSavingModes.AsExternalPngFilesReferencedViaSvg;
            options.FontSavingMode = HtmlSaveOptions.FontSavingModes.SaveInAllFormats;
            options.LettersPositioningMethod = HtmlSaveOptions.LettersPositioningMethods.UseEmUnitsAndCompensationOfRoundingErrorsInCss;
            options.HtmlMarkupGenerationMode = HtmlSaveOptions.HtmlMarkupGenerationModes.WriteAllHtml;
            options.PreventGlyphsGrouping = false;
            options.RemoveEmptyAreasOnTopAndBottom = false;
            options.PagesFlowTypeDependsOnViewersScreenSize = false;
            options.UseZOrder = true;
            options.SaveTransparentTexts = false;
            options.SaveShadowedTextsAsTransparentTexts = false;

            foreach (Aspose.Pdf.Page page in doc.Pages)
            {
                     Document newDocument = new Document();
                     newDocument.Pages.Add(page);
                     newDocument.Save(dirName + @"\Page_" + pageNumber + ".pdf",SaveFormat.Pdf); //this was used to generate the attached pdf, in order to see if the links are ok in the pdf page
                     newDocument.Save(dirName + @"\Page_" + pageNumber + ".html", options);
            }

As for the html file contents, (though the image paths are modified after with another script - if you think necessarily i can generate an intermediary file with the raw output from Aspose.PDF):

Page_007.zip (2.0 KB)

Farhan.Raza · August 8, 2018, 6:34pm

@bpetrea

We have worked with the data shared by you and have been able to reproduce the issue in our environment. A ticket with ID PDFNET-45219 has been logged in our issue management system for further investigation and resolution. The ticket ID has been linked with this thread so that you will receive notification as soon as the ticket is resolved.

We are sorry for the inconvenience.

aspose.notifier · November 14, 2019, 9:14pm

The issues you have found earlier (filed as PDFNET-45219) have been fixed in Aspose.PDF for .NET 19.11.