I have recently returned to the part the Document Management System part of our product that requires text extraction from MS Word documents to enable their indexing for search. I was trying to evaluate some products that do this as an alternative to the open source POI project, which seems to work but is in alpha status. Aspose was one that we were considering. The api for what we want seems very simple which is nice. I’m calling the following code to extract text:
Document doc = new Document(INPUT_FILE);
String text = doc.getText();
This seems to work fine until the word documents get larger (4MB a problem). Is there a known limit to the size of documents that can be handled ion in this way or is this a feature of the evaluation version? The error is out of memory, have tried increasing allocation on command line with no effect.
Some large documents don’t give out of memory but:
java.lang.StringIndexOutOfBoundsException: String index out of range: 0
at java.lang.String.charAt(String.java:558)
at com.aspose.words.Field.b(Unknown Source)
at com.aspose.words.ez.a(Unknown Source)
at com.aspose.words.ez.a(Unknown Source)
at com.aspose.words.ez.b(Unknown Source)
at com.aspose.words.ez.a(Unknown Source)
at com.aspose.words.ez.a(Unknown Source)
at com.aspose.words.ez.a(Unknown Source)
at com.aspose.words.cv.a(Unknown Source)
at com.aspose.words.Document.a(Unknown Source)
at com.aspose.words.Document.a(Unknown Source)
at com.aspose.words.Document.(Unknown Source)
at com.aspose.words.Document.(Unknown Source)
at com.axxia.aspose.AsposeTextExtraction.testWordDocumentExtraction(AsposeTextExtraction.java:29)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:585)
at junit.framework.TestCase.runTest(TestCase.java:154)
at junit.framework.TestCase.runBare(TestCase.java:127)
at junit.framework.TestResult$1.protect(TestResult.java:106)
at junit.framework.TestResult.runProtected(TestResult.java:124)
at junit.framework.TestResult.run(TestResult.java:109)
at junit.framework.TestCase.run(TestCase.java:118)
at junit.framework.TestSuite.runTest(TestSuite.java:208)
at junit.framework.TestSuite.run(TestSuite.java:203)
at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:478)
at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:344)
at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:196)
I think there is a limitation on string size in Java. Try saving the document to stream instead, using Document.save method with SaveFormat.FORMAT_TEXT as a parameter.
Best regards,
I’m now testing with the code that you suggested, thanks. I still get the StringIndexOUtOfBoundsException with certain documents.
On further testing I don’t think the StringIndexOutOfBoundsException shown in above post is due to document size. I’ve tried a bigger doc than the one which fails and its ok. It seems it is just certain word docs which fail.
An example is ‘Thinking In Java 2 (word version)’ the famous Java tome by Bruce Eckle available as a free download from:
Hi, Thomas,
I couldn’t reproduce your error. I had tried 13M word file which had produced 10M text file and doesn’t escape. But I’m using relatively large –Xms and –Xmx numbers for my JVM.
In addition to doc.getText() you can use doc.save(“filename.txt”, SaveFormat.FORMAT_TEXT);.
Java port of TxtWriter is available at https://forum.aspose.com/t/123619 . By extending TxtWriter you can get more flexible tool, for instance filter out all page headers & footers or preserve only Normal-style text etc.
Can you attach the file that produces the error?
Best Regards,
I’ll try your suggestions thanks, I think I may have tried something similar already folowing the first reply. I’ll try again with the source link. One document that fails for me with StringIndexOutOfBoundsException is the zipped word doc found here:
Yep, just tried the TxtWriter code from the link with same results. A 5MB doc worked fine but then TIJ2.doc failed with the following:
java.lang.StringIndexOutOfBoundsException: String index out of range: 0
at java.lang.String.charAt(String.java:558)
at com.aspose.words.Field.b(Unknown Source)
at com.aspose.words.ez.a(Unknown Source)
at com.aspose.words.ez.a(Unknown Source)
at com.aspose.words.ez.b(Unknown Source)
at com.aspose.words.ez.a(Unknown Source)
at com.aspose.words.ez.a(Unknown Source)
at com.aspose.words.ez.a(Unknown Source)
at com.aspose.words.cv.a(Unknown Source)
at com.aspose.words.Document.a(Unknown Source)
at com.aspose.words.Document.a(Unknown Source)
at com.aspose.words.Document.(Unknown Source)
at com.aspose.words.Document.(Unknown Source)
at com.axxia.aspose.AsposeTextExtraction.testTxtWriter(AsposeTextExtraction.java:61)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:585)
at junit.framework.TestCase.runTest(TestCase.java:154)
at junit.framework.TestCase.runBare(TestCase.java:127)
at junit.framework.TestResult$1.protect(TestResult.java:106)
at junit.framework.TestResult.runProtected(TestResult.java:124)
at junit.framework.TestResult.run(TestResult.java:109)
at junit.framework.TestCase.run(TestCase.java:118)
at junit.framework.TestSuite.runTest(TestSuite.java:208)
at junit.framework.TestSuite.run(TestSuite.java:203)
at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:478)
at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:344)
at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:196)
Hi, Thomas,
Thanks for your file. It contains invalid dead&empty field codes. Perhaps, this file was made outside the Word or was somehow corrected later. In the next release we will correct this behavior so Aspose.Words will just ignore such field codes.
Regards,
I’ve just downloaded the latest evaluation version of Aspose Words for java. The problem outlined in this thread still remains. I don’t know if this evaluation release has changed from the last one I tried, jars have same names but 2 are larger. The tool does what we want apart from this one known problem and we are looking to purchase it straight away if we are confident that the fix will be in. It may be that the fix is in the version for purchase. Any estimates on timescales would be appreciated.
Hi Thomas,
As far as I know the fix is not in the latest version yet. I will check when version with the fix can be delivered and let you know as soon as possible.
Also, please note that after you purchase you will have the right for free upgrades for one year. That means that your license will be valid for all versions published within one year from the date of purchase.
Best regards,
The document contains invalid XE fields that have no field codes. Microsoft Word seems to just ignore them. We changed out code to do the same.
Try using this build attached here.
I managed to open, save and get text from this document successfully.
I took a look at the Aspose Words download page and the Java version is still 2.0.0.0. I guess this means that the fix you sent has not yet made it into a release. The fix works fine but the project manager here would like it to be contained as part of a proper release. It would be good if we could purchase this product now if the fix is contained, is this possible?
Hi, Thomas,
Your project manager worries for nothing, the next Aspose.Words for Java release (containing the fix along with other things) will be released within few days.
Best Regards,
Still waiting for an updated java version of Aspose Words to appear on the downloads site. I assume that new releases will be posted here?
Another thing, I just noticed that the current evaluation version that I’m using contains this jar:
jaxen-1.1-beta-8.jar
Is it necessary to use this beta jar if all we want to do is text extraction form a pdf? I don’t know if this jar is included in a release of the product but we need to avoid using beta tools in our product.
Hi Thomas,
The hotfix for Aspose.Words Java is almost out, please keep an eye at our blogs. Also, we will notify you personally here in this thread.
As to jaxen-1.1-beta-8.jar - you can get rid of it unless you select nodes using XPath expressions in your project.
A little addition to my previous post regarding the “beta”. Here’s a quotation from the developer’s site: The current version is 1.1 beta 11. 1.1 is a major upgrade that significantly improves jaxen’s conformance to the underlying XPath specs. Even though it’s still officially a beta, this release is a vast improvement over 1.0, and all users are strongly encouraged to upgrade. http://www.jaxen.org/releases.html