Structured Document Tag RemoveSelfOnly results in decreased "Total" column width when published to PDF

Hi team,
While evaluating latest Aspose.Words library, we noticed the issue when remove rich text SDTs in the Word document
and then published it to PDF.
The 2nd column (“Total”) in the table after SDTs are removed and document is published to PDF, has smaller width than the one in
the original Word document.
Attached is zip archive illustrating the problem, it contains:

  • DocWithSomeText_HeaderFooterStyles.docx - input word document with two SDTs containing rich text;
  • 24_10_output_DocWithSomeText_HeaderFooterStyles.pdf - output after removing SDTs self only and save to PDF
  • 24_10_output_removedEmbeddedContents.docx - output after removing SDTs self only and save as DOCX
  • SaveAsPDF_DocWithSomeText_HeaderFooterStyles.pdf - expected output (created via Word → SaveAs → PDF)
  • TestWordToPDF2.java - sample test program illustrating the issue

This issue is similar to the one posted to the forum:
Structured Document Tag RemoveSelfOnly results in increased 1st column width when published to PDF
However in this case the second column (“Total”) is smaller in size and as a result the numbers are partially printed on the next line (“276”, “886” etc.).
Also the suggested workaround of saving the document to DOCX before creating PDF is not working for this issue.
This issue occurs under both Linux 7 and Windows 11 OS.
Please let us know, if you need additional information.
Thank you.
WordToPDF_misalignedColumns.zip (212.6 KB)

@oraspose
We have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.

Issue ID(s): WORDSNET-27489

You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.

@oraspose We have completed analyzing the problem. The problematic tables are auto-fit tables (table layout depends on content metrics). The document appears to be generated by Aspose.Words and incorrect table column width data are stored in tblGrid elements. Aspose.Words is capable of reconstructing the table layout from cell properties and content metrics, but the logic is not applied because your code does not configure advanced typography support and some runs in the problematic tables have advanced typography features (ligatures). Without advanced typography, content metrics are not considered reliable and Aspose.Words falls back to the older table layout approach that mostly relies on widths stored inside tblGrid elements. As the widths in tblGrid are incorrect, the table layouts do not match MS Word.

Advanced typography features are supported by Aspose.Words via Aspose.Words.Shaping.HarfBuzz. You should be install the above package

<dependency>
    <groupId>com.aspose</groupId>
    <artifactId>aspose-words</artifactId>
    <version>24.10</version>
    <classifier>jdk17</classifier>
</dependency>
<dependency>
    <groupId>com.aspose</groupId>
    <artifactId>aspose-words</artifactId>
    <version>24.10</version>
    <classifier>shaping-harfbuzz-plugin</classifier>
</dependency>

and modify the code as shown below:

// Open a document
Document doc = new Document("in.docx");

// When text shaper factory is set, layout starts to use OpenType features.
// An Instance property returns static BasicTextShaperCache object wrapping HarfBuzzTextShaperFactory
doc.getLayoutOptions().setTextShaperFactory(com.aspose.words.shaping.harfbuzz.HarfBuzzTextShaperFactory.getInstance());

// Render the document to PDF format
doc.save("out.pdf");

See more about advanced typograph features in the documentation:
https://docs.aspose.com/words/java/enable-opentype-features/

With advanced typography enabled, the output matches MS Word rather well.

Hello Alexey.
Thanks for the quick analysis.

The problematic tables are auto-fit tables

Is there some flag/option that can be applied at the time when HTML content is inserted in the Word document, such that auto-fit will be disabled and this issue will be resolved without incorporating another 3-rd party library?

Thank you,
Yan

@oraspose I am afraid, there is no such option. As mentioned in the analysis, the older table layout approach that mostly relies on widths stored inside tblGrid elements. So you can try setting explicit cell widths in your HTML. This should also resolve the problem.

Can you please help me understand why the embedded tables create an issue with publish to PDF whereas they look fine in Word?
Also the same exact tables embedded into the empty Word document with no header nor footer do NOT create the issue when Word is published to PDF?
Attached is the archive ColumnWidth_publishToPDFIssue.zip that contains:

  • DocWithSomeText_NoHeaderNoFooter.docx - input Word document without header/footer ; only embedded content tables;
  • 24_10_output_DocWithSomeText_NoHeaderNoFooter.pdf - generated PDF;
  • 24_10_output_DocWithSomeText_HeaderFooterStyles.pdf - for comparison output generated from the Word doc in the original first attachment.

Also help me understand how to use the suggested shaping-harfbuzz-plugin library which is .NET on the Java server where the word to pdf transformation takes place.
Thank you.
ColumnWidth_publishToPDFIssue.zip (155.5 KB)

@oraspose As you may know MS Word documents are flow by their nature, so there is no “page” concept. Consumer applications reflows document content into pages on the fly. The same does Aspose.Words when render document to PDF. We continuously work to make our document layout engine as close to MS Word as possible. But sometimes there might be inconsistency as in your case.

shaping-harfbuzz-plugin is provided for both Java and .NET. To use it in Java you should add the following dependency:

<dependency>
    <groupId>com.aspose</groupId>
    <artifactId>aspose-words</artifactId>
    <version>24.10</version>
    <classifier>shaping-harfbuzz-plugin</classifier>
</dependency>

and modify the code as shown below:

// Open a document
Document doc = new Document("in.docx");

// When text shaper factory is set, layout starts to use OpenType features.
// An Instance property returns static BasicTextShaperCache object wrapping HarfBuzzTextShaperFactory
doc.getLayoutOptions().setTextShaperFactory(com.aspose.words.shaping.harfbuzz.HarfBuzzTextShaperFactory.getInstance());

// Render the document to PDF format
doc.save("out.pdf");

See more about advanced typograph features in the documentation:
https://docs.aspose.com/words/java/enable-opentype-features/

Perhaps I was not clear in the description of the problem. The problem is that the width of the column “Total” of the table is decreased when PDF is generated from the Word document.
However, if the Word document contains the same exact table but no header/footer - the column ‘Total’ in the generated PDF has the same width as in the Word document.
What is the root cause of this difference ?

Thank you,
Yan

@oraspose The root cause of the problem has been described above.
When you copy the table into another document in MS Word, most likely it recalculates and updates table grid and the table can be properly rendered by Aspose.Words.

Hello again.

Can you please elaborate on what is incorrect with the table column width?
As far as we can see, the tables appear to be shown correctly in Word. Also, when use Word → SaveAs → PDF, the generated PDF has proper column width for the “Total” column.

The way the grids are shown in the samples attached earlier, they are identical between the one that generated proper PDF (DocWithSomeText_NoHeaderNoFooter.docx) and the one where “Total” column is smaller (DocWithSomeText_HeaderFooterStyles.docx).

Thank you.

@oraspose Internally the table grid is specified like this in MS Word documents:

<w:tblGrid>
	<w:gridCol w:w="3314" />
	<w:gridCol w:w="1385" />
	<w:gridCol w:w="1002" />
	<w:gridCol w:w="924" />
	<w:gridCol w:w="858" />
	<w:gridCol w:w="14117" />
</w:tblGrid>

The values in the problematic table are invalid and that is why the table is rendered improperly.

As you may know, MS Word documents are flow by their nature. So the consumer application, like MS Word or OpenOffice, reflows the document content into pages on the fly. The same does Aspose.Words when render the document. We are continuously work on improving our document layout engine to make it as close to MS Word as possible. But in this particular case, unfortunately, our layout engine gives incorrect result.

I would have agreed with you, but the same exact values in the table in the other document “DocWithSomeText_NoHeaderNoFooter.docx” produced proper PDF as expected.
Can you please explain what is causing this difference?

@oraspose There are many aspects that can affect the table layout. Some runs in the problematic tables have advanced typography features (ligatures). Without advanced typography, content metrics are not considered reliable and Aspose.Words uses values that are specified in tblGrid, that is in this particular case are invalid.