Java Aspose HTML to PDF issue with 24.12 - Word is splliting into letters in linux

sthadhani · February 24, 2025, 2:00pm

Hello,

Environment Details:

Aspose.PDF Version: 24.12
Jenkins OS: Linux
Local OS: Windows 11
Java Version: OpenJDK 17

Problem Description:

I am using Aspose.PDF to convert PDFs to HTML. The generated HTML content differs when running on Jenkins (Linux) compared to my local Windows machine.

Observations:

On Windows (local machine), the extracted text is correct and appears as expected.
On Jenkins (Linux machine), the extracted text is scattered into multiple <span> tags or is not formatted correctly.
The test case checks for a specific word (e.g., "manon") in the extracted text. It passes on Windows but fails on Jenkins because the text appears in a broken format.

Debugging Steps Taken:

Tried Different LettersPositioningMethods:

UseEmUnitsAndCompensationOfRoundingErrorsInCss
UsePixelUnitsInCssLetterSpacingForIE
No improvement.

Checked Fonts Availability:

Installed ttf-mscorefonts-installer on Jenkins, and the issue was resolved.

Verified Text Extraction:

Before installing fonts → Text was broken.
After installing fonts → Text appeared correctly.

Possible Cause:

It seems that Aspose.PDF relies on system fonts for text extraction and rendering. When the required fonts are missing (as in Linux Jenkins environments), the text is not positioned correctly in the HTML output.

Expected vs. Actual Behavior:

Expected Behavior (Windows & After Font Installation on Linux):

manon is present correctly in one tag.

Actual Behavior (Linux Before Font Installation):
manon

I have following questions :

Why does missing system fonts affect text extraction in HTML conversion?
Is there a way to make Aspose.PDF independent of system fonts?
Are there any best practices to ensure consistent output across different OS environments?

Professionalize.Discourse · February 24, 2025, 2:00pm

@sthadhani

Can you please clarify if you are looking for specific code examples or general guidance on the questions you have regarding Aspose.PDF and font handling?

sthadhani · February 24, 2025, 2:04pm

I’m looking for specific solution.

asad.ali · February 24, 2025, 5:16pm

@sthadhani

Aspose.PDF relies on font information embedded in the PDF or available in the system to correctly render and extract text. When a required font is missing OR a document uses a font that is not installed, Aspose.PDF may replace it with a default font, which can lead to incorrect or misaligned extracted text.

Yes! You can configure Aspose.PDF for Java to use custom font directories instead of system fonts. Aspose.PDF allows setting a custom font folder instead of using system fonts:

// Set a custom font folder (avoid dependency on system fonts)
FontRepository.addLocalFontPath("/path/to/custom/fonts");

You can do following in order to make sure the consistent output across different OS environments:

If generating PDFs programmatically, enable font embedding.
If running on Linux, install missing fonts with:

sudo apt install ttf-mscorefonts-installer

If a font is missing, substitute it with a similar one:

FontRepository.getSubstitutions().add(new FontSubstitutionRule("OriginalFont", "Arial"));

Different OS environments may render PDFs differently based on DPI settings. Standardize resolution by setting TextExtractionOptions:

TextExtractionOptions options = new TextExtractionOptions(TextExtractionOptions.TextFormattingMode.Pure);
textAbsorber.setExtractionOptions(options);

sthadhani · February 25, 2025, 6:16am

As I mentioned in my earlier post, installing fonts on linux works. But I’m looking for some other better approach where we don’t need to install anything on os.

We don’t know what fonts are missing at runtime, hence substituting becomes difficult.

I tried last piece of code snippet related to TextExtractionOptions , but no success.

Is there any way to fix this issue without installing fonts on linux ?

asad.ali · February 25, 2025, 11:08am

@sthadhani

As shared earlier, Aspose.PDF relies on system fonts to process PDF files. Installing fonts and keeping them in the System is one of the basic guidelines to use the API. We are afraid that it may not be possible at the moment to suggest any solution that does not involve installing fonts. If possible, would you please share your sample PDF document, code snippet and steps to reproduce the issue? We will log an investigation ticket in our issue management system and share the ID with you.

sthadhani · March 5, 2025, 11:52am

Sample Code and file.zip (229.1 KB)

I have attached zip containing sample pdf file and sample code to reproduce the issue.

To reproduce the issue, you need to run both the test cases of AsposePDFToHtmlUtilsTest.Java

These test cases would work fine if you use aspose-pdf 23.11, but with 24.12 it would fail.

Expectation is that with the latest version of aspose-pdf, these test cases should be passed without any issue.

Let me know if you need more info.

asad.ali · March 5, 2025, 4:56pm

@sthadhani

We have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.

Issue ID(s): PDFJAVA-44788

You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.