PDF to HTML conversion configuration

How should the Aspose PDF Java API be configured to produce the same result when converting to HTML as the online converter (Convert PDF to HTML | Online and Free)?

@martinrixham

Are you facing some issues with the on-premise API? Can you please share your sample files and sample code snippet that you are using? We will test the scenario in our environment and address it accordingly.

Our initial configuration looks something like this (in kotlin):

    val document = Document(inputStream)
    val options = HtmlSaveOptions()
    val outputStream = ByteArrayOutputStream()

    options.htmlMarkupGenerationMode = HtmlSaveOptions.HtmlMarkupGenerationModes.WriteOnlyBodyContent
    options.fixedLayout = true
    options.customStrategyOfCssUrlCreation = CSSUrlStrategy
    options.customCssSavingStrategy = CssSavingStrategy
    options.customResourceSavingStrategy = ResourceSavingStrategy

    document.save(outputStream, options)

In the cases I have tried it just extracts the text but doesn’t replicate any formatting, equations, fonts.

@martinrixham

Would you kindly also share a sample source file for our reference so that we can investigate the case in our environment and address it accordingly.

I can’t share internal files, but it’s not hard to find publicly available ones to test on:

https://www.adobe.com/support/products/enterprise/knowledgecenter/media/c4611_sample_explain.pdf

@martinrixham

We used this file to convert into HTML using below code snippet and from the online app. We received almost similar results i.e. formatting was incorrect and text was missing:

Document pdfDocument = new Document(dataDir + "M611heaviside.pdf");
HtmlSaveOptions htmlSaveOptions = new HtmlSaveOptions();
//htmlSaveOptions.setDocumentType(HtmlDocumentType.Xhtml);
htmlSaveOptions.setFixedLayout(true);
htmlSaveOptions.setLettersPositioningMethod(LettersPositioningMethods.UseEmUnitsAndCompensationOfRoundingErrorsInCss);
htmlSaveOptions.setFontSavingMode(HtmlSaveOptions.FontSavingModes.SaveInAllFormats);
htmlSaveOptions.setImageResolution(72);
htmlSaveOptions.setPartsEmbeddingMode(HtmlSaveOptions.PartsEmbeddingModes.EmbedAllIntoHtml);
htmlSaveOptions.setRasterImagesSavingMode(HtmlSaveOptions.RasterImagesSavingModes.AsEmbeddedPartsOfPngPageBackground);
pdfDocument.save(dataDir + "output.html", htmlSaveOptions);

outputhtmls.zip (109.5 KB)

Therefore, an issue as PDFJAVA-41914 has been logged in our issue tracking system for further investigation. We will look into its details and keep you posted with the status of its rectification. Please be patient and spare us some time.

We are sorry for the inconvenience.

Thank you, is there a way to track progress on this issue?

@martinrixham

The issue status can be seen at the bottom of this thread inside Issue Status box. Furthermore, it is logged in our internal issue management system and you would not be able to access it. Nevertheless, we will provide you an update via this forum thread as soon as we have one about its resolution. Please spare us some time.

This is continuing to be tracked on my side so I need to feed back on whether this is being fixed or should be considered a future improvement, which of those should I feed back?

@martinrixham

The ticket is currently pending for an investigation. As it was logged in free support model, it will be investigated and resolved on a first come first serve basis. We will surely inform you as soon as we have some definite news about its resolution or fix ETA. Please spare us some time.

We are sorry for the inconvenience.