Hi,
As per the subject, what is the charset encoding for the text string that is returned from the Range.getText() while parsing a Word document.
Is it to the system default java charset or is it in UTF-8.
Is there any other methods that can be used to grab the text from a word document in a specific encoding.
(Something similar to the extractText(java.lang.String encoding) method from the pdfextractor class)
Hi
<?xml:namespace prefix = o ns = "urn:schemas-microsoft-com:office:office" />
Thanks for your inquiry. Please follow the link to learn how you can extract text from Word document:
You can set encoding using code like the following:
Document doc = new Document("C:\\Temp\\test.doc");
doc.getSaveOptions().setTxtExportEncoding(Charset.defaultCharset());
doc.save("C:\\Temp\\out.txt");
Hope this helps.
Best regards,