Japanese Text using MS Mincho Font not rendered correctly within Word Document

oraspose · February 26, 2024, 2:02pm

Hello Team,

I have 2 HTML input containing Japanese Texts with different font size and formatting but using the same font family name “MS Mincho”. I have installed this font in the linux system where the HTML is being converted to EMF image using ASPOSE Words library v24.1 and also within the Windows machine where the EMF image is being embedded into the MS Excel Document.

I do notice that the “text2” HTML input gets converted to EMF image while the “text1” HTML input is not converted correctly but shows garbled characters.

Sample Logic:

public void getRenderedDocument(String inpFile) {
    try {
        byte[] htmlBytes = Files.readAllBytes(inpFile + ".html");  // here the input file is passed "text1" and "text2"
        // Define HTML loadoptions to load the HTML bytes into Word Document
        HtmlLoadOptions options = new HtmlLoadOptions();
        // To avoid converting Metafile images to PNG image.
        options.setConvertMetafilesToPng(false);

        // Initialize Word Document with HTML bytes.
        Document doc = new Document(htmlBytes, options);
        
        // Get DocumentBuilder instance to update document properties
        // such as Size, Alignment, Format, etc.
        DocumentBuilder builder = new DocumentBuilder(doc);
        PageSetup pageSetup = builder.getPageSetup();
        Section section = doc.getFirstSection();
        Body body = section.getBody();

        // update the Page Properties such as Margin and Size
        updatePageProperties(pageSetup, contentLayout);

        // The source HTML passed within {@code Document} will have atleast
        // 1 <table> element present when it is a Text/Note/Grid object.
        TableCollection tables = body.getTables();
        if (tables.getCount() == 1) {
            Table table = tables.get(0);
            updateTableProperties(table);
        }

        // reset document last paragraph formatting properties
        resetLastParagraphProperties(body);

        //Save docx as EMF image
        ImageSaveOptions emfOptions = new ImageSaveOptions(SaveFormat.EMF);
        emfOptions.setPageSet(new PageSet(0));        
        doc.save(inpFile + ".emf", emfOptions);
    } catch (Exception ex) {
        throw new IllegalStateException(ex.getMessage(), ex);
    }
}

private void updatePageProperties(PageSetup pageSetup) {
    double imgHeight = ConvertUtil.pixelToPoint(70.0);
    double imgWidth = ConvertUtil.pixelToPoint(385.0);
    double margin = 0;

    // reset page margin
    pageSetup.setLeftMargin(margin);
    pageSetup.setRightMargin(margin);
    pageSetup.setTopMargin(margin);
    pageSetup.setBottomMargin(margin);

    // Set header and footer distance to default 0. Required to ensure no
    // extra spacing is coming from header or footer.
    pageSetup.setFooterDistance(0);
    pageSetup.setHeaderDistance(0);

    // Default paper type is LETTER so change to CUSTOM when setting new
    // size.
    pageSetup.setPaperSize(PaperSize.CUSTOM);
    pageSetup.setPageWidth(imgWidth);
    pageSetup.setPageHeight(imgHeight);
}

private void updateTableProperties(Table table) throws Exception{
    // Reset left/right indent since table might be shifted left
    table.setLeftIndent(0);
    
    // Set Table {@code AutoFitBehavior} value
    table.autoFit(AutoFitBehavior.FIXED_COLUMN_WIDTHS);
    
    double rowHeight = ConvertUtil.pixelToPoint(70.0);
    double cellWidth = ConvertUtil.pixelToPoint(385.0);

    for (Row row : table.getRows()) {
    	RowFormat rowFmt = row.getRowFormat();    		
    	rowFmt.setAllowBreakAcrossPages(false);		
    	rowFmt.setHeight(rowHeight);
    	rowFmt.setHeightRule(HeightRule.EXACTLY);   
    		
    	for(Cell cell : row.getCells()) {
    		CellFormat cellFmt = cell.getCellFormat();
    		cellFmt.setWidth(cellWidth);
    		cellFmt.setTopPadding(0);
    		cellFmt.setBottomPadding(0);
    	}
    }
    
    Node node = table.getLastRow().getLastChild();
    if (node != null && node.getNodeType() == NodeType.PARAGRAPH &&
        "\uFEFF \r".equals(node.getText())) {
        node.remove();
    }
}

private void resetLastParagraphProperties(Body body) {
    Paragraph lastPara = body.getLastParagraph();
    String paratext = lastPara.getText();
    if (StringUtils.isNullOrBlank(paratext)) {
        ParagraphFormat paraFmt = lastPara.getParagraphFormat();
        paraFmt.setPageBreakBefore(true);
    }
}

Attachments: SampleText.zip (17.7 KB)

Any idea, why is the Japanese characters not showing up correctly only for the input file “text1”? Both of them using the same fonts and they are passed to the ASPOSE Words library during the conversion. The issue will be clear when the HTML and EMF files for “text1” are compared visually.

alexey.noskov · February 26, 2024, 2:13pm

@oraspose It looks like there is something wrong with encoding in your text1.html. If open it using MS Word, it also shows only garbage characters: ms.docx (14.3 KB)

However, if explicitly specify UTF8 encoding upon loading the document, it is converted properly:

LoadOptions opt = new LoadOptions();
opt.setEncoding(Charset.forName("UTF8"));
    
Document doc = new Document("C:\\Temp\\in.html", opt);
doc.save("C:\\Temp\\out.docx");

oraspose · February 26, 2024, 2:24pm

@alexey.noskov The encoding thing worked for me. Thanks for the pointer. Really appreciate for the quick response.