Doc to html indetify unrecognized character

divesh_iris · December 6, 2012, 5:15am

Hi,
We are using Aspose Words for converting doc to html, there is a requirement where need to check the document for any unrecognized character/symbols .
Atttached a dummy doc file.
Regards,
Divesh Salian

tahir.manzoor · December 7, 2012, 8:35am

Hi Divesh,

Thanks for your inquiry. Please note that Aspose.Words tries to mimic the same behavior as MS Word do. Microsoft Word document can include a number of special characters that represent fields, form fields, shapes, OLE objects, footnotes etc. To work with SpecialChar, Aspose.Words provides SpecialChar class.

You can not identify unrecognized character by using Aspose.Words. However, you can get text from your document and can check each character. Please note that all text of the document is stored in runs of text. A single Run may contains multiple character. You can get the text of each Run by Run.Text property. Please see the attached image.
https://reference.aspose.com/words/net/aspose.words/run/

You can use HtmlSaveOptions.Encoding property while conversion from Doc/Docx to HTML. This property ppecifies the encoding to use when exporting to HTML, MHTML or EPUB. Default value is new UTF8Encoding(false)
https://reference.aspose.com/words/net/aspose.words.saving/htmlsaveoptions/