Aspose Words Java and Unicode

Since Aspose LINQ reporting engine doesn’t support all valid unicode characters that could appear in an XML tag I am trying to determine an algorithm to either identify XML tags that are not valid or escape them before passing them to the engine.

Some unicode characters in the extended latin range work such as Ǎ U01CD but others in the same range don’t such as Nj U01CA.

What algorithm does the LINQ engine use to determine valid characters when doing the merge from XML?

For Example:

<TestNj>TestNj</TestNj> where Nj is U01CA generates the following error.

com.aspose.words.net.System.Data.DataException: asposewobfuscated.zz33: Unexpect
ed character ‘?’ (code 459 / 0x1cb) excepted space, or ‘>’ or "/>"
at [row,col {unknown-source}]: [5,18]
at com.aspose.words.net.System.Data.DataSet.readXml(Unknown Source)
at com.aspose.words.net.System.Data.DataSet.readXml(Unknown Source)
at com.aspose.words.net.System.Data.DataSet.readXml(Unknown Source)
at word.WordImportExportInterface.main(WordImportExportInterface.java:74
)
Caused by: asposewobfuscated.zz33: Unexpected character ‘?’ (code 459 / 0x1cb) e
xcepted space, or ‘>’ or "/>"
at [row,col {unknown-source}]: [5,18]
at asposewobfuscated.zz1Z.zzr(Unknown Source)
at asposewobfuscated.zz29.zzYk(Unknown Source)
at asposewobfuscated.zz29.zzYl(Unknown Source)
at asposewobfuscated.zz29.zzYRx(Unknown Source)
at asposewobfuscated.zz29.next(Unknown Source)
at com.aspose.words.net.System.Data.zzT.zzZ(Unknown Source)
at com.aspose.words.net.System.Data.zzT.zzZ(Unknown Source)
at com.aspose.words.net.System.Data.zzT.zzYVe(Unknown Source)
at com.aspose.words.net.System.Data.DataSet.readXml(Unknown Source)
… 3 more
asposewobfuscated.zz33: Unexpected character ‘?’ (code 459 / 0x1cb) excepted spa
ce, or ‘>’ or "/>"

See attached XML File.

The import code looks like:

package word; 

import com.aspose.words.*;
import com.aspose.words.net.System.Data.DataSet;
import java.io.File;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;

public class TestImport {

    public static void main(String[] args)
    {
        try {
            Document doc = new Document("TemplateWithUTF8.doc");
            ReportingEngine engine = new ReportingEngine();
            DataSet ds = new DataSet("items");
            ds.readXml("data4.xml");
            engine.buildReport(doc, ds, "ds");
            doc.updateFields();
            doc.updatePageLayout();
            doc.save("Out.docx");
        } catch (Exception e) {
            System.out.println(e.toString());
        }
    }
}

Hi Mark,

Thanks for your inquiry. We introduced the DataSet.readXml method in Aspose.Words for Java to read xml with same functionality of System.Data.DataSet.ReadXml.

System.Data.DataSet.ReadXml also throws exception while reading the shared XML. Could you please share some detail about your test case? We will then provide you more information about your query.

We are planning on using Aspose to do an export to MS Word feature from XML in our product. However our product can be highly customized by our customers to include many custom fields and a defined workflow. When users create a new field they give it a name and a field code that represents the field. We are using the field codes as the XML keys for the data.
These field codes can have any number of special characters, spaces, percent signs, etc. which would not be valid in the XML Tag name and therefore in the word template <<[fieldcode]>>.
We are therefore planning on pre-processing the field codes to escape any characters that could not appear in the XML tag for the LINQ processing and would generate errors.
I first tried limiting the characters that categorized as Letter - Lowercase, Letter - Uppercase, and Number, Decimal Digit in the Unicode Character Categories, https://www.fileformat.info/info/unicode/category/index.htm. However there were issues with blocks such as he Arabic (Unicode Block) where Arabic Letters would work but Arabic numbers wouldn’t.
So secondly I tried limiting it to the Basic Latin, Latin-1 Supplement, Latin Extended-A and Latin Extended-B blocks, https://en.wikipedia.org/wiki/Unicode_block.
However characters in Latin Extended-B such as:
Nj - Latin Capital Letter NJ (U01CA), https://www.fileformat.info/info/unicode/char/01ca/index.htm, do not work.
But letters in the same block right by it such as:
Ǎ - Latin Capital Letter A with caron (U01CD) https://www.fileformat.info/info/unicode/char/01cd/index.htm do work fine.
I also tried not including any “Combined” characters in the Unicode standard, but that didn’t make any difference.
I need to know specifically how to determine which characters will work in the XML tags and the word templates and which ones I will have to escape to make them valid. I have tried all of the various categorizations in the Unicode standards and cannot come up with an algorithm which is fool proof. Also we are doing the XML generation in C++.
Regards.

Hi Mark,

Thanks for your inquiry. We have logged a feature request as WORDSJAVA-1264 for your requirements. Our product team will look into the possibility of implementation of this feature. Once we have any information about this feature, we will update you via this forum thread.

Please let us know if you have any more queries.

Hi

The best way to determine which characters will work in the XML tags is to look at the XML specs the section “Names” see

Aspose.Words relies on Woodstox library which follows XML 1.0/1.1 standards.

So you should have a look at WstxInputData.isNameChar(char) and XmlChars.is11NameChar(char)

I hope this will help.

Thanks