Word-break not working as expected when converting html to pdf using Java

Hello,

we are having problems when converting html documents to pdf using Java, as we cannot get word breaking work as expected. We need that the text would be broken at spaces and not in the middle of words. How can this be achieved?

Here is a sample of the input html:

<!DOCTYPE html>
<html lang="en"
  xmlns="http://www.w3.org/1999/xhtml"
  xmlns:th="http://www.thymeleaf.org"
>
<head>
<title></title>
<meta charset="utf-8"/>
<meta name="viewport"
	  content="width=device-width, initial-scale=1"/>
<style>
	body {
		size:A4;
		font-family: "Helvetica Neue", Helvetica, Arial, sans-serif;
	}

	#class2 {
		font-size: 64px;
		font-weight: bold;
		word-break: break-word;
		width: 100%;
	}

	#class1{
		text-align: center;
		width: 100%;
	}

    </style>
</head>
<body>
    <div id="class1">
       <div id="class2">
	    This text should be broken by space
       </div>
    </div>
   <div id="class1">
       <div id="class2">
	   PI
       </div>
    </div>
 </body>
</html>

In the first case “This text should be broken by space”, in the second case, “PI” text should not be broken at all.

Here is the output:

cover.pdf (29.9 KB)

Here is a sample code:

    loadAsposeLicense();
	HtmlLoadOptions options = new HtmlLoadOptions();
	options.setInputEncoding(StandardCharsets.UTF_8.name());
	options.getPageInfo().setMargin(new MarginInfo(MARGIN_LEFT_PT, MARGIN_BOTTOM_PT, MARGIN_RIGHT_PT, MARGIN_TOP_PT));

	Document document = new Document(RESOURCE_DIR + "cover.html", options);

	document.setFitWindow(true);
	document.setLayersAdded(true);
	try(FileOutputStream fos = new FileOutputStream("cover.pdf"))
	{
		document.save(fos);
	}
	catch(FileNotFoundException e)
	{
		e.printStackTrace();
	}
	catch(IOException e)
	{
		e.printStackTrace();
	}
	finally
	{
		document.close();
	}

Arjana Bivainiene

@arjana

Could you please share the values of MarginInfo here for testing? We will investigate the issue and provide you more information on it.

The values are as follows:

public static final int MARGIN_TOP_PT = 30;
public static final int MARGIN_RIGHT_PT = 10;
public static final int MARGIN_BOTTOM_PT = 40;
public static final int MARGIN_LEFT_PT = 10;

But the actual values of margins only have effect on the exact place where the line is broken. Here is an example of the output if I remove setting of margins: cover_wo_margins.pdf (29.9 KB)

We are using the Aspose version 21.8 but it works the same with the version 22.6

@arjana

We have logged this problem in our issue tracking system as PDFJAVA-42054. You will be notified via this forum thread once this issue is resolved.

We apologize for your inconvenience.

We also noticed that in some places a text is broken into multiple lines if the last letter is one of the following: Iijl
Like in my example, the word PI was split into different lines, and if we use Pl/Pi/Pj instead, it is split as well. If there is another last letter, e. g. t, the text remains on the same line (Pt).

I also added another sample html with an analogous situation: if the last letter of a table cell is “l/i/j/I”, then it is moved to another line, regardless of how much free space is still available on the same line. We can see this happenning with the word “Real” in some cells. If the letter “l” is replaced with another one except l/i/j/l, the word is not broken into multiple lines.

letter_l.pdf (122.7 KB)

@arjana

Can you please share the sample HTML in .zip format for our reference? We will log another ticket and share the ID with you.

Here is a sample html/zip letter_l_break.zip (3.3 KB)
I reduced it to contain less data but the aforementioned line breaking can be observed in table header cells after conversion to pdf.

@arjana
We have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.

Issue ID(s): PDFJAVA-42747

You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.