Aspose.word version 22.5 convert space letter into Unicode f020 in PDF file

license-it · June 8, 2022, 6:14pm

We use Aspose.Words to convert doc,docx file to PDF file. After upgrade to ver 22.5 (from 21.2), we find some space letters in docx file are converted to Unicode f020, which is not a standard encoding for the space letter. These letters are converted to standard space letters in version 21.2

image-20220603-154418.png (306.3 KB)
image-20220603-154550.png (156.3 KB)

the word file is
127.docx (25.3 KB)

alexey.noskov · June 9, 2022, 4:47am

@license-it Aspose.Words does not replace symbols in your document upon exporting to PDF. If you check the source you can notice that whitespaces you have highlighted are inserted as symbols. You can also notice this if unzip DOCX document and explore document.xml:

<w:r>
	<w:rPr>
		<w:rFonts w:ascii="Symbol" w:eastAsia="Symbol" w:hAnsi="Symbol" w:cs="Symbol"/>
		<w:sz w:val="19"/>
		<w:szCs w:val="19"/>
	</w:rPr>
	<w:t></w:t>
</w:r>
<w:r>
	<w:rPr>
		<w:rFonts w:ascii="Symbol" w:eastAsia="Symbol" w:hAnsi="Symbol" w:cs="Symbol"/>
		<w:sz w:val="19"/>
		<w:szCs w:val="19"/>
	</w:rPr>
	<w:t></w:t>
</w:r>
<w:r>
	<w:rPr>
		<w:rFonts w:ascii="Symbol" w:eastAsia="Symbol" w:hAnsi="Symbol" w:cs="Symbol"/>
		<w:sz w:val="19"/>
		<w:szCs w:val="19"/>
	</w:rPr>
	<w:t></w:t>
</w:r>
<w:r>
	<w:rPr>
		<w:rFonts w:ascii="Symbol" w:eastAsia="Symbol" w:hAnsi="Symbol" w:cs="Symbol"/>
		<w:spacing w:val="47"/>
		<w:sz w:val="19"/>
		<w:szCs w:val="19"/>
	</w:rPr>
	<w:t></w:t>
</w:r>
<w:r>
	<w:rPr>
		<w:rFonts w:ascii="Arial" w:eastAsia="Arial" w:hAnsi="Arial" w:cs="Arial"/>
		<w:spacing w:val="-1"/>
		<w:sz w:val="19"/>
		<w:szCs w:val="19"/>
	</w:rPr>
	<w:t>feeling</w:t>
</w:r>

Also you can check this with Aspose.Words code:

Document doc = new Document(@"C:\Temp\in.docx");

NodeCollection paragraphs = doc.GetChildNodes(NodeType.Paragraph, true);
// The the paragraph with mentioned text
Paragraph p = (Paragraph)paragraphs[8];
foreach (Run r in p.Runs)
{
    if (r.Font.Name == "Symbol")
    {
        Console.WriteLine("{0:X}", (int)r.Text[0]);
    }
}

Code will return:

F0B7
F020
F020
F020

So Aspose.Words exports to PDF exactly what is in the source document.

license-it · June 9, 2022, 10:19pm

Hi @alexey.noskov
Thanks for the reply. It is very useful.

But I use the same code to test Aspose.Words version 22.5 and Version 22.4. The result is different.

The 22.4 can recognize the Symbol space, and translate it into normal space
Aspose-word-22.4.png (148.7 KB)

The 22.5 cannot recognize the Symbol space,
Aspose-word-22.5.png (164.5 KB)

Does not Aspose.Words 22.5 translate the Symbol font? Is it a way to set up the translation?

alexey.noskov · June 10, 2022, 5:17am

@license-it Could you please attach your output PDF documents here for testing? On my side spaces look like you expect:

I suspect the issue occurs because Symbol font has been substituted because it is not available in the environment where conversion is performed. Could you please make sure the same set of fonts is used for conversion using 22.4 and 22.5 versions of Aspose.Words?

license-it · June 10, 2022, 2:11pm

Hi @alexey.noskov

The source code
Program.zip (582 Bytes)

The PDF from 22.5
127-22.5.pdf (3.2 MB)

The PDF from 22.4
127-22.4.pdf (3.2 MB)

the source code is the same for both versions and runs on my machine only. It means the system environment is the same for both versions.

alexey.noskov · June 10, 2022, 3:28pm

@license-it Thank you for additional information. I have managed to reproduce the problem and logged it as WORDSNET-23969. We will keep you updated and let you know once it is resolved or we have more information for you.
Note, the problem is reproducible only if PdfSaveOptions.EmbedFullFonts is enabled.

license-it · June 14, 2022, 1:37am

@alexey.noskov Thanks for the reply.
We need to include all fonts info in the PDF file

alexey.noskov · June 14, 2022, 4:37am

@license-it Thank you for additional information. We will keep you informed and let you know once it is resolved.

Konstantin.Kornilov · June 16, 2022, 5:45pm

@license-it
As @alexey.noskov said, at the moment Aspose.Words do not replace PUA character from symbolic fonts with Unicode. The fact that U+F020 PUA character was converted to U+0020 space character in previous Aspose.Words versions was an exception and was not documented behavior. We will consider adding the conversion from PUA to Unicode when saving to PDF from common Windows symbolic fonts like “Symbol”, “Wingdings”, etc. You will be informed in this topic once it will be implemented.

As a workaround in order to get the correct text extraction from PDF output you could replace the PUA characters from symbolic fonts with Unicode characters from proper Unicode fonts. It could be done either in Microsoft Word GUI or in Aspose.Words DOM. For the space character any font could be used. For other PUA characters “Segoe UI Symbol” Unicode font could be used. You could see the example of the conversion tables here. Also you could check the documentation article about PUA characters in the context of accessibility.

license-it · June 16, 2022, 5:57pm

@Konstantin.Kornilov, Thanks for the info.

you could replace the PUA characters from symbolic fonts with Unicode characters from proper Unicode fonts. It could be done either in Microsoft Word GUI or in Aspose.Words DOM.

Do you have an example code to demonstrate how to replace PUA characters via Aspose.Words DOM? Thanks.

alexey.noskov · June 16, 2022, 7:37pm

@license-it You can try using code like the following:

Document doc = new Document(@"C:\Temp\in.docx");

// Replace F020 with regular whitespace and reset its font to default used in the document.
FindReplaceOptions opt = new FindReplaceOptions();
opt.ApplyFont.Name = doc.Styles.DefaultFont.Name;
doc.Range.Replace("\xF020", " ", opt);

PdfSaveOptions pdfOpt = new PdfSaveOptions();
pdfOpt.EmbedFullFonts = true;

doc.Save(@"C:\Temp\out.pdf", pdfOpt);

license-it · June 16, 2022, 8:58pm

@alexey.noskov Thank you very much for the example.

Just one question:

how to check if the F020 letter is using Symbol font?

Thanks for the help!

alexey.noskov · June 17, 2022, 4:09am

@license-it You can loop through the Run nodes and check their fonts, if the font is Symbol then perform replace operation:

Document doc = new Document(@"C:\Temp\in.docx");

// Replace F020 with regular whitespace and reset its font to default used in the document.
FindReplaceOptions opt = new FindReplaceOptions();
opt.ApplyFont.Name = doc.Styles.DefaultFont.Name;

NodeCollection runs = doc.GetChildNodes(NodeType.Run, true);
foreach (Run r in runs)
{
    if (r.Font.Name.Equals("Symbol") && r.Text.Contains("\xF020"))
        r.Range.Replace("\xF020", " ", opt);
}

PdfSaveOptions pdfOpt = new PdfSaveOptions();
pdfOpt.EmbedFullFonts = true;

doc.Save(@"C:\Temp\out.pdf", pdfOpt);