Free Support Forum - aspose.com

Extracting from Word/doc files

I’m having trouble finding sample code to do the following (the code I
did find in this forum did not work, some methods and properties
referred to by the code were not there, so I wonder if they were using
a different version of the library)


  1. Save embedded attachments (OLE objects) as separate files
  2. Create tiff (or png) image of a document
  3. Save (semi-formatted) text of a document



    My project is in Java, so I’d rather use the Java version, but from
    looking at the documentation, the Java libraries are lagging behind, so
    I may have to use the .net versions. Please advise…

    Q1. What is the timeframe before the Java libraries catch up?

    Q2. Are the .net versions more active–are they always be more up to date and have the latest bug fixes?


Hi

<?xml:namespace prefix = o ns = "urn:schemas-microsoft-com:office:office" />

Thanks for your request.

1. Could you please attach the document from which you need to extract OLE objects. I will check it on my side and provide you more information.

2. Apsose.Words for Java does not support converting documents to images yet. This feature will be available in the very end of this year or at the beginning of the next. I will notify you as soon as this feature is available.

3. It is not quite clear for what you mean. Could you please be more specific? What document format are you interested in?

Currently we are working on synchronizing .NET and Java versions of Aspose.Words. We will finish this work somewhere at the beginning of the next year.

Best regards.

Hi Alexey

1. You can use any Word document, where you insert a file or another excel into the word, and it shows as an icon. (I have attached a document as well).

2. Would most of the functionality be added to the Java version by, say, March 2010?

3. Just for a “extract text of the document” purposes, so we can index the text. It doesn’t have to be too formatted, but the paragraphs should be in order, maybe headers/footers added whenever a new header or footer is set in the document (not on each page since there is no concept of a page for the extracted text), and text of word-art/“textbox” items extracted.

I’m looking for sample code to do these things, so I can evaluate the library. If you have code, or have references in the manual where I can find it, please send it to me.

Regards

Hi

<?xml:namespace prefix = o ns = "urn:schemas-microsoft-com:office:office" />

Thank you for additional information. Here is simple code, which shows how to extract OLE objects from the document:

// Save output document.

doc.Save(@"Test001\out.doc");

// Open document.

Document doc = new Document("C:\\Temp\\in.doc");

// Get all shapes.

NodeCollection shapes = doc.getChildNodes(NodeType.SHAPE, true);

// Loop through all shapes.

for (int i = 0; i < shapes.getCount(); i++)

{

Shape shape = (Shape)shapes.get(i);

// Check if the current shape has OLE object

if (shape.getOleFormat() == null)

continue;

// Determine extenfion of the object.

// Let's use bin extension by default.

String extension = "bin";

if (shape.getOleFormat().getProgId().equals("Word.Document.8"))

extension = "doc";

if (shape.getOleFormat().getProgId().equals("Excel.Sheet.8"))

extension = "xls";

// Save OLE object.

shape.getOleFormat().save(String.format("C:\\Temp\\out_%d.%s", i, extension));

}

Regarding Excel objects, I managed to reproduce the problem (output Excel files cannot be opened in Excel). Your request has been linked to the appropriate issue. You will be notified as soon as it is resolved.

2. I think, most of functionality will be added before February 2010.

3. Please see the following link to learn how to extract text from documents:

http://www.aspose.com/documentation/.net-components/aspose.words-for-.net-and-java/howto-extract-text-only.html

Also, you can create your own converter using Aspose.Words. The technique is described here:

http://www.aspose.com/documentation/.net-components/aspose.words-for-.net-and-java/com/aspose/words/documentvisitor.html

Hope this helps.

Best regards.

The issues you have found earlier (filed as 4838) have been fixed in this update.


This message was posted using Notification2Forum from Downloads module by aspose.notifier.

The issues you have found earlier (filed as WORDSJAVA-20) have been fixed in this .NET update and in this Java update.


This message was posted using Notification2Forum from Downloads module by aspose.notifier.