Aspose.pdf PDF to HTML output difference

piyushrchilli · April 8, 2025, 10:33am

Hi,
I’m from RChilli. We’re using Aspose.PDF version 24.7 to convert PDFs to HTML. However, we’ve noticed inconsistencies in the HTML output when parsing some resumes—the results differ between our local system and our linux server.

You can view the comparison here: Aspose.pdf - Diffchecker
The left side shows the output from our local system, and the right side shows the output from our server.

Sample code:

public String convertPdfToHtml(byte[] pdfData) {
String htmlContent = null;
ByteArrayOutputStream byteArrayOutputStream = null;
Document document = null;
try {
try (InputStream inputStream = new ByteArrayInputStream(pdfData)) {
com.aspose.pdf.License license = new com.aspose.pdf.License();
license.setLicense(GlobalConstants.pdfLicenceStream());
document = new Document(inputStream);

            HtmlSaveOptions htmlSaveOptions = new HtmlSaveOptions();
            htmlSaveOptions.setPartsEmbeddingMode(HtmlSaveOptions.PartsEmbeddingModes.EmbedAllIntoHtml);
            htmlSaveOptions.setRasterImagesSavingMode(
                    HtmlSaveOptions.RasterImagesSavingModes.AsEmbeddedPartsOfPngPageBackground);
            
            byteArrayOutputStream = new ByteArrayOutputStream();
            document.save(byteArrayOutputStream, htmlSaveOptions);
            
            htmlContent = byteArrayOutputStream.toString(StandardCharsets.UTF_8); 
            GlobalConstants.setPdfConversionCount(GlobalConstants.getPdfConversionCount() + 1);
            if (GlobalConstants.getPdfConversionCount() % 100 == 0) {
                MemoryCleaner.clearStaticInstances();
                MemoryCleaner.clear();
                MemoryCleaner.clearKeepingFontSettings();
                GlobalConstants.setPdfConversionCount(0);
            }
        }
    } catch (Exception e) {
        Utility.errorLog(e);
    } finally {
        try {
            if (byteArrayOutputStream != null) {
                byteArrayOutputStream.close();
            }
            if (document != null) {
                document.close();
            }        
        } catch (IOException e) {
            Utility.errorLog(e);
        }
    }
    return htmlContent;
}

_MR__H_JH__Canva_AMRESH-KUMAR-1727958743.zip (180.1 KB)

Professionalize.Discourse · April 8, 2025, 10:35am

@piyushrchilli

It seems that you are experiencing inconsistencies in the HTML output when converting PDFs to HTML using Aspose.PDF version 24.7 on different systems. This can happen due to various factors, including differences in the environment, font availability, or settings used during the conversion process.

Here are a few suggestions to help you troubleshoot and potentially resolve the issue:

Font Availability: Ensure that the fonts used in the PDF are available on both your local system and the Linux server. If the fonts are not available, Aspose.PDF may substitute them with default fonts, leading to differences in the output. You can specify a default font to use for substitution in your HtmlSaveOptions:
```
HtmlSaveOptions htmlSaveOptions = new HtmlSaveOptions();
htmlSaveOptions.setDefaultFontName("YourDefaultFontName");
```
HtmlSaveOptions Configuration: Review the settings in your HtmlSaveOptions. Certain options, such as PartsEmbeddingMode and RasterImagesSavingMode, can affect the output. You might want to experiment with these settings to see if they impact the consistency of the output:
```
htmlSaveOptions.setPartsEmbeddingMode(HtmlSaveOptions.PartsEmbeddingModes.EmbedAllIntoHtml);
htmlSaveOptions.setRasterImagesSavingMode(HtmlSaveOptions.RasterImagesSavingModes.AsEmbeddedPartsOfPngPageBackground);
```
Environment Differences: Check for any differences in the Java runtime environment or the Aspose.PDF library version between your local system and the server. Even minor version differences can sometimes lead to variations in output.
Debugging Output: To better understand the differences, you can log or output the HTML generated on both systems and compare them. This might help you identify specific elements or styles that are causing discrepancies.
Testing with Different PDFs: If possible, test with different PDF files to see if the issue is consistent across various documents. This can help determine if the problem is specific to certain PDFs or a broader issue with the conversion process.

If the problem persists after trying these suggestions, consider reaching out to Aspose support with detailed information about your environment and the specific PDFs you are working with. They may be able to provide more targeted assistance.

Feel free to share any additional details or specific errors you encounter during the conversion process, and I can help further!

Sources:
[1]: Convert PDF to HTML with Font Substitution - Aspose
[2]: Class HtmlSaveOptions - Aspose.pdf

piyushrchilli · April 10, 2025, 6:35am

Hi,
We’ve set the default font, but we’re still encountering issues with PDF parsing on the Linux server.

If there is anything else where we can by default set the font which can be used while pdf to html conversion please share that.

Code:

public String convertPdfToHtml(byte[] pdfData) {
        String htmlContent = null;
        ByteArrayOutputStream byteArrayOutputStream = null;
        Document document = null;
        try {
            try (InputStream inputStream = new ByteArrayInputStream(pdfData)) {
            	com.aspose.pdf.License license = new com.aspose.pdf.License();
    			license.setLicense(GlobalConstants.pdfLicenceStream());
                document = new Document(inputStream);

                HtmlSaveOptions htmlSaveOptions = new HtmlSaveOptions();
                htmlSaveOptions.setPartsEmbeddingMode(HtmlSaveOptions.PartsEmbeddingModes.EmbedAllIntoHtml);
                htmlSaveOptions.setRasterImagesSavingMode(
                        HtmlSaveOptions.RasterImagesSavingModes.AsEmbeddedPartsOfPngPageBackground);
                htmlSaveOptions.setDefaultFontName("Arial");
                
                byteArrayOutputStream = new ByteArrayOutputStream();
                document.save(byteArrayOutputStream, htmlSaveOptions);
                
                htmlContent = byteArrayOutputStream.toString(StandardCharsets.UTF_8); 
                GlobalConstants.setPdfConversionCount(GlobalConstants.getPdfConversionCount() + 1);
                if (GlobalConstants.getPdfConversionCount() % 100 == 0) {
                    MemoryCleaner.clearStaticInstances();
                    MemoryCleaner.clear();
                    MemoryCleaner.clearKeepingFontSettings();
                    GlobalConstants.setPdfConversionCount(0);
                }
            }
        } catch (Exception e) {
            Utility.errorLog(e);
        } finally {
            try {
                if (byteArrayOutputStream != null) {
                    byteArrayOutputStream.close();
                }
                if (document != null) {
                    document.close();
                }        
            } catch (IOException e) {
                Utility.errorLog(e);
            }
        }
        return htmlContent;
    }

Linux server:
image.png (30.0 KB)

System:
image.png (31.1 KB)

asad.ali · April 10, 2025, 9:04pm

@piyushrchilli

We have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.

Issue ID(s): PDFJAVA-44894

You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.