Special character like 0x1e

We have a problem where unicode value 0x1e displays as - is word but java parser doesn’t like it. I’m trying to see if someone had the same kind of situation where they were able to convert the unicode charecter to equivalent display character. Will appreciate any help

Hi,

Thanks for your request.
Could you attach the document that has the special character? Also please provide more details when exactly the problem with java parser occurs. If it throws an exception then please attach the stacktrace log.

Thanks,

Hi Denis,
Thank you for your quick response. The word document that we are trying to parse has control character like 0x1e which is a soft hyphen and shows as hyphen in word. I was trying to see if there is was a way to specify in you API to convert these special character to the character that would appear in word eg. 0x1e as - (hyphen).
thank you

Here is the stack trace were are getting

* Root cause is: An invalid XML character(Unicode: 0x1e) was found in the element content of the document.
org.xml.sax.SAXParseException: An invalid XML character(Unicode: 0x1e)
was found in the element content of the document.
at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source)
at
org.apache.taglibs.standard.tag.common.xml.ParseSupport.parseInputSource(ParseSupport.java:227)
at 
org.apache.taglibs.standard.tag.common.xml.ParseSupport.parseInputSourceWithFilter(ParseSupport.java:193)
at 
org.apache.taglibs.standard.tag.common.xml.ParseSupport.parseReaderWithFilter(ParseSupport.java:199)
at 
org.apache.taglibs.standard.tag.common.xml.ParseSupport.parseStringWithFilter(ParseSupport.java:206)
at 
org.apache.taglibs.standard.tag.common.xml.ParseSupport.doEndTag(ParseSupport.java:138)
at 
org.apache.jsp.WEB_002dINF.jsp.faculty.facultyResume_jsp._jspx_meth_x_005fparse_005f0(facultyResume_jsp.java:793)
at 
org.apache.jsp.WEB_002dINF.jsp.faculty.facultyResume_jsp._jspService(facultyResume_jsp.java:326)
at org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:70)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:803)
at 
org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:373)
at 
org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:336)
at org.apache.jasper.servlet.JspServlet.service(JspServlet.java:265)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:803)
at

Hi

Thanks for your request. I think, you can just replace soft hyphen with hyphen before parsing the document. For example, you can try using code like the following to achieve this.

// Open source document.
Document doc = new Document("C:\\Temp\\in.doc");
// Replace soft hyphen (non-breaking hyphen) with hyphen.
doc.getRange().replace(String.valueOf(ControlChar.NON_BREAKING_HYPHEN_CHAR), "-", false, false);
// Print document's text.
System.out.println(doc.toTxt());

Hope this helps.
Best regards.

Thank you very much for the reply. I can do that way. There are lot more character then this. I will be really difficult to find out all the character and their converted word character. I actually wanted to know if you have a library that does conversion from all control character that word uses. This must be a common problem that your customer faces.

Hi

Thanks for your inquiry. Due to the stack trace you provided, I suppose that you would like to put the String into an XML. So maybe, you can just use a CDATA section in the XML removing the need to escape the data. Please let me know if such approach will work for you.
Best regards.