FileCorruptedException loading HTML when running in servlet container

Hi,
I’m running the Diablo SE 1.6.0_07-b02 runtime on FreeBSD 7.1 i386 with Aspose Words 4.0.2. The .doc file referenced in the code below is attached. The HTML is inline. Their specific contents don’t seem to matter.

From the command line the code runs correctly (catches no exceptions). When run in a servlet container (I’ve tested in Jetty 6.1.25 and Tomcat 6.0.18), however, it displays (I’ve attached the stack trace at the bottom of this message):

Loaded doc ok
Exception from html (Words): com.aspose.words.FileCorruptedException: The document appears to be corrupted and cannot be loaded.
I’ve tried it with Aspose.Words.jdk16.jar in both the /WEB-INF/lib/ directory (where the webapp classloader finds it), and where the common server-wide classloader will find it, with the same result. The code DOES work as expected with an old version of Aspose Words (2.6.0).

At the bottom of the stack trace it suggests it’s caused by an ‘UnsupportedCharsetException’. I’m not specifying UTF-7 anywhere (the document itself contains only ASCII characters). Has something changed in the HTML import code with regards to character sets, or is this a red herring?

Thanks, and if there’s any additional information that would be useful please let me know.

Steve

======= HTML

<html>
<head>
    <title>Title</title>
</head>
<body>
    Body
</body>
</html>

======= web.xml (lives in /WEB-INF/web.xml)

<?xml version="1.0" encoding="UTF-8" ?>
<web-app>
	<display-name>Example</display-name>
	<!-- Declare the existence of a servlet. -->
	<servlet>
		<servlet-name>AsposeImport</servlet-name>
		<servlet-class>AsposeImport</servlet-class>
	</servlet>
	<!-- Map URLs to that servlet. -->
	<servlet-mapping>
		<servlet-name>AsposeImport</servlet-name>
		<url-pattern>/servlet</url-pattern>
	</servlet-mapping>
</web-app>

======= CODE

import javax.servlet.http.*;*
import javax.servlet.;
import java.io.*;*
import com.aspose.words.;

public class AsposeImport extends HttpServlet {

  public void doGet(HttpServletRequest req, HttpServletResponse res) throws ServletException, IOException
  {
    PrintWriter out = res.getWriter();
    try
    {
      Document d = new Document("/home/steve/test.doc");
      out.println("Loaded doc ok");
    }
    catch (Exception e) { out.println("Exception from doc (Words): " + e.toString()); }
    try
    {
      Document d = new Document("/home/steve/simple.html");
      out.println("Created html");
    }
    catch (Exception e) { out.println("Exception from html (Words): " + e.toString()); }
    out.close();
  }
  public static void main(String[] args)
  {
    System.out.println("MAIN");
    try
    {
      Document d = new Document("/home/steve/test.doc");
      System.out.println("Loaded doc ok");
    }
    catch (Exception e) { System.out.println("Exception (doc): " + e.toString()); }
    try
    {
      Document d = new Document("/home/steve/simple.html");
      System.out.println("Created html");
    }
    catch (Exception e) { System.out.println("Exception (html): " + e.toString()); }
  }
}

========== STACKTRACE

com.aspose.words.FileCorruptedException: The document appears to be corrupted and cannot be loaded.
at com.aspose.words.Document.a(Document.java:1371)
at com.aspose.words.Document.b(Document.java:1358)
at com.aspose.words.Document.a(Document.java:1246)
at com.aspose.words.Document.(Document.java:143)
at com.aspose.words.Document.(Document.java:117)
at AsposeImport.doGet(AsposeImport.java:26)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:707)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511)
at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:390)
at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:765)
at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:440)
at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
at org.mortbay.jetty.Server.handle(Server.java:326)
at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:926)
at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549)
at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212)
at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
at org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:410)
at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)
Caused by: java.lang.IllegalStateException: java.nio.charset.UnsupportedCharsetException: UTF-7
at asposewobfuscated.kj.Pc(Encoding.java:475)
at asposewobfuscated.kj.OY(Encoding.java:237)
at asposewobfuscated.kj.af(Encoding.java:453)
at com.aspose.words.iy.nW(FileFormatDetector.java:246)
at com.aspose.words.iy.g(FileFormatDetector.java:38)
at com.aspose.words.Document.b(Document.java:1262)
… 23 more
Caused by: java.nio.charset.UnsupportedCharsetException: UTF-7
at java.nio.charset.Charset.forName(Charset.java:506)
at asposewobfuscated.kj.Pc(Encoding.java:471)
… 28 more

Hi there,
Thanks for your inquiry.
Could you please try loading your HTML file using the code below and see if that solves the issue.

Document doc = new Document("/home/steve/simple.html", LoadFormat.HTML);

If that does not work could you please try loading it from a stream as well (input can still come from the disk).

InputStream is = new ByteArrayInputStream(html.getBytes());
Document doc = new Document(is,"",LoadFormat.HTML,"");

Thanks,

Hi Adam,
Thanks, these both work (and I’ve tested on our real application). Are there likely to be other problems with the automatic type determination (i.e. should I always specify the LoadFormat)?

Thanks again,

Steve

Hi

Thanks for your inquiry. As far I know, such problem only occurs with HTML format. You can try using Document.detectFileFormat method to detect format of document before loading. If this method will return LoadFormat.UNKNOWN, then you can try loading document with LoadFormat.HTML.
https://reference.aspose.com/words/java/com.aspose.words/fileformatutil#detectFileFormat(java.lang.String)
Best regards,