Problem extracting text from MS Word doc with evaluation version

thomasthomas1999 · September 29, 2006, 6:03am

I have recently returned to the part the Document Management System part of our product that requires text extraction from MS Word documents to enable their indexing for search. I was trying to evaluate some products that do this as an alternative to the open source POI project, which seems to work but is in alpha status. Aspose was one that we were considering. The api for what we want seems very simple which is nice. I’m calling the following code to extract text:

Document doc = new Document(INPUT_FILE);
String text = doc.getText();

This seems to work fine until the word documents get larger (4MB a problem). Is there a known limit to the size of documents that can be handled ion in this way or is this a feature of the evaluation version? The error is out of memory, have tried increasing allocation on command line with no effect.
Some large documents don’t give out of memory but:

java.lang.StringIndexOutOfBoundsException: String index out of range: 0
at java.lang.String.charAt(String.java:558)
at com.aspose.words.Field.b(Unknown Source)
at com.aspose.words.ez.a(Unknown Source)
at com.aspose.words.ez.a(Unknown Source)
at com.aspose.words.ez.b(Unknown Source)
at com.aspose.words.ez.a(Unknown Source)
at com.aspose.words.ez.a(Unknown Source)
at com.aspose.words.ez.a(Unknown Source)
at com.aspose.words.cv.a(Unknown Source)
at com.aspose.words.Document.a(Unknown Source)
at com.aspose.words.Document.a(Unknown Source)
at com.aspose.words.Document.(Unknown Source)
at com.aspose.words.Document.(Unknown Source)
at com.axxia.aspose.AsposeTextExtraction.testWordDocumentExtraction(AsposeTextExtraction.java:29)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:585)
at junit.framework.TestCase.runTest(TestCase.java:154)
at junit.framework.TestCase.runBare(TestCase.java:127)
at junit.framework.TestResult$1.protect(TestResult.java:106)
at junit.framework.TestResult.runProtected(TestResult.java:124)
at junit.framework.TestResult.run(TestResult.java:109)
at junit.framework.TestCase.run(TestCase.java:118)
at junit.framework.TestSuite.runTest(TestSuite.java:208)
at junit.framework.TestSuite.run(TestSuite.java:203)
at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:478)
at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:344)
at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:196)

Would appreciate any advice thanks.

miklovan · September 29, 2006, 11:33am

I think there is a limitation on string size in Java. Try saving the document to stream instead, using Document.save method with SaveFormat.FORMAT_TEXT as a parameter.
Best regards,

thomasthomas1999 · October 5, 2006, 9:14am

Hi,

I’m now testing with the code that you suggested, thanks. I still get the StringIndexOUtOfBoundsException with certain documents.
On further testing I don’t think the StringIndexOutOfBoundsException shown in above post is due to document size. I’ve tried a bigger doc than the one which fails and its ok. It seems it is just certain word docs which fail.

An example is ‘Thinking In Java 2 (word version)’ the famous Java tome by Bruce Eckle available as a free download from:

http://www.mindviewinc.com/

Any suggestions / reasons / ideas appreciated,

Thomas

Konstantin · October 5, 2006, 9:04pm

Hi, Thomas,
I couldn’t reproduce your error. I had tried 13M word file which had produced 10M text file and doesn’t escape. But I’m using relatively large –Xms and –Xmx numbers for my JVM.
In addition to doc.getText() you can use doc.save(“filename.txt”, SaveFormat.FORMAT_TEXT);.
Java port of TxtWriter is available at https://forum.aspose.com/t/123619 . By extending TxtWriter you can get more flexible tool, for instance filter out all page headers & footers or preserve only Normal-style text etc.
Can you attach the file that produces the error?
Best Regards,

thomasthomas1999 · October 6, 2006, 1:29am

Hi Konstantin,

I’ll try your suggestions thanks, I think I may have tried something similar already folowing the first reply. I’ll try again with the source link. One document that fails for me with StringIndexOutOfBoundsException is the zipped word doc found here:

http://www.mindviewinc.com/downloads/TIJ2-Word.zip

Thanks for your help,
Thomas Gascoigne (Axxia Systems)

thomasthomas1999 · October 6, 2006, 1:45am

Yep, just tried the TxtWriter code from the link with same results. A 5MB doc worked fine but then TIJ2.doc failed with the following:

java.lang.StringIndexOutOfBoundsException: String index out of range: 0
at java.lang.String.charAt(String.java:558)
at com.aspose.words.Field.b(Unknown Source)
at com.aspose.words.ez.a(Unknown Source)
at com.aspose.words.ez.a(Unknown Source)
at com.aspose.words.ez.b(Unknown Source)
at com.aspose.words.ez.a(Unknown Source)
at com.aspose.words.ez.a(Unknown Source)
at com.aspose.words.ez.a(Unknown Source)
at com.aspose.words.cv.a(Unknown Source)
at com.aspose.words.Document.a(Unknown Source)
at com.aspose.words.Document.a(Unknown Source)
at com.aspose.words.Document.(Unknown Source)
at com.aspose.words.Document.(Unknown Source)
at com.axxia.aspose.AsposeTextExtraction.testTxtWriter(AsposeTextExtraction.java:61)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:585)
at junit.framework.TestCase.runTest(TestCase.java:154)
at junit.framework.TestCase.runBare(TestCase.java:127)
at junit.framework.TestResult$1.protect(TestResult.java:106)
at junit.framework.TestResult.runProtected(TestResult.java:124)
at junit.framework.TestResult.run(TestResult.java:109)
at junit.framework.TestCase.run(TestCase.java:118)
at junit.framework.TestSuite.runTest(TestSuite.java:208)
at junit.framework.TestSuite.run(TestSuite.java:203)
at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:478)
at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:344)
at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:196)

Any ideas?

Thanks, Thomas

Konstantin · October 6, 2006, 6:19am

Hi, Thomas,
Thanks for your file. It contains invalid dead&empty field codes. Perhaps, this file was made outside the Word or was somehow corrected later. In the next release we will correct this behavior so Aspose.Words will just ignore such field codes.
Regards,

thomasthomas1999 · October 6, 2006, 7:46am

Cheers, that was good to know. That document has been cursing me for several months now!

What would the timing be of this fix reaching a release?

Thanks, Thomas

Konstantin · October 6, 2006, 8:06am

Our current estimation for the next java release is the end of October. I think the fix will be inside.
Regards,

thomasthomas1999 · November 15, 2006, 3:04am

Hi Chaps,

I’ve just downloaded the latest evaluation version of Aspose Words for java. The problem outlined in this thread still remains. I don’t know if this evaluation release has changed from the last one I tried, jars have same names but 2 are larger. The tool does what we want apart from this one known problem and we are looking to purchase it straight away if we are confident that the fix will be in. It may be that the fix is in the version for purchase. Any estimates on timescales would be appreciated.

Thanks,
Thomas Gascoigne (Axxia Systems)

miklovan · November 15, 2006, 9:37am

Hi Thomas,
As far as I know the fix is not in the latest version yet. I will check when version with the fix can be delivered and let you know as soon as possible.
Also, please note that after you purchase you will have the right for free upgrades for one year. That means that your license will be valid for all versions published within one year from the date of purchase.
Best regards,

romank · November 15, 2006, 11:33am

We will produce a fix for you today. Sorry the issue did not make into 2.0 release.

romank · November 15, 2006, 5:09pm

The document contains invalid XE fields that have no field codes. Microsoft Word seems to just ignore them. We changed out code to do the same.
Try using this build attached here.
I managed to open, save and get text from this document successfully.

thomasthomas1999 · November 16, 2006, 3:17am

Thanks a lot.

Yes swapping the jar gets it working. Will this fix make it into a release that we can purchase within the next two weeks? That would be very helpful.

Thanks,
Thomas Gascoigne (Axxia Systems)

miklovan · November 16, 2006, 4:03am

Yes, of course. This fix will be included to the next version which will be published within the next two weeks.
Best regards,

thomasthomas1999 · December 1, 2006, 3:17am

Hi Chaps,

I took a look at the Aspose Words download page and the Java version is still 2.0.0.0. I guess this means that the fix you sent has not yet made it into a release. The fix works fine but the project manager here would like it to be contained as part of a proper release. It would be good if we could purchase this product now if the fix is contained, is this possible?

Thanks again,
Thomas Gascoigne (Axxia Systems)

Konstantin · December 1, 2006, 4:00am

Hi, Thomas,
Your project manager worries for nothing, the next Aspose.Words for Java release (containing the fix along with other things) will be released within few days.
Best Regards,

thomasthomas1999 · December 8, 2006, 5:00am

Hi Guys,

Still waiting for an updated java version of Aspose Words to appear on the downloads site. I assume that new releases will be posted here?

Another thing, I just noticed that the current evaluation version that I’m using contains this jar:
jaxen-1.1-beta-8.jar

Is it necessary to use this beta jar if all we want to do is text extraction form a pdf? I don’t know if this jar is included in a release of the product but we need to avoid using beta tools in our product.

Thanks again,
Thomas Gascoigne (Axxia Sytems)

DmitryV · December 8, 2006, 6:05am

Hi Thomas,
The hotfix for Aspose.Words Java is almost out, please keep an eye at our blogs. Also, we will notify you personally here in this thread.
As to jaxen-1.1-beta-8.jar - you can get rid of it unless you select nodes using XPath expressions in your project.

DmitryV · December 8, 2006, 6:25am

A little addition to my previous post regarding the “beta”. Here’s a quotation from the developer’s site:
The current version is 1.1 beta 11. 1.1 is a major upgrade that significantly improves jaxen’s conformance to the underlying XPath specs. Even though it’s still officially a beta, this release is a vast improvement over 1.0, and all users are strongly encouraged to upgrade.
http://www.jaxen.org/releases.html