Can any OLE embedded file be extracted from (.doc(x) / .rtf)?

Hey,


I have written a sample application that should extract all OLE embedded documents from any of the file types that Aspose.Words For Java supports. Once extracted, the embedded object is stored in a file on disk. I have tested this scenario up through version 14.2.0.

Some issues I’ve noticed with the files that are extracted:
  1. The extension of many types of files are lost. Is this a limitation?
  2. Some types of files can’t be opened. The application complains that they are corrupted, even if the file extension is corrected to what it should be.
Is this a limitation?


Thank you for any help!

Hi Chase,

Thanks for your inquiry. Could you please attach your input Docx/RTF along with code here for testing? I will investigate the issue on my side and provide you more information.

I’ve done some additional testing and swapped in version 14.7.0 of Aspose.Words for Java. It appears that some of the issues I was seeing have been resolved, however there are still some oddities that I will list below:


  • The biggest issue: vsdx file is corrupted. It looks like the compound binary file contains an OOXML file called Package. If I extract the Package and append the “.vsdx” extension, the file will open correctly, though it is a hassle. This seems to be similar to an issue I noticed with “.xlsm” files in version 13.5.0.
  • The rtf file was converted into what appears to be the “.doc” format. It does open okay and has the correct contents. Is there a reason this is converted to a “.doc”?
  • For the dot (Microsoft Word Document Template), it the suggested extension is “.doc”. The file appears to open in Word regardless of whether it is set as “.doc” or ".dot"
  • For the vsdx file, the suggested extension is ".bin"
Thank you for any help with the issues above!


I have attached the following:
  • AsposeForumDemo.zip -> Contains the example application that will extract the OLE objects from every file in the input directory and save them in the output directory. The extension should be the suggestion extension. You will need to fix the classpath to point to the Aspose library since I did not include it.
  • OriginalEmbedObjects -> This contains the original files that I embedded in RTF that is part of the demo.


Hi Chase,

Thanks for sharing the detail.

apatter:

The biggest issue: vsdx file is corrupted. It looks like the compound binary file contains an OOXML file called Package. If I extract the Package and append the “.vsdx” extension, the file will open correctly, though it is a hassle.

I have tested the scenario and have managed to reproduce the same issue at my side. For the sake of
correction, I have logged this problem in our issue tracking system as WORDSNET-10589. I have linked this forum thread to the same issue and you will be notified via this forum thread once this issue is resolved.
apatter:

This seems to be similar to an issue I noticed with “.xlsm” files in version 13.5.0.

I have not found this issue using the shared code. I have attached the output xlsm file with this post for your kind reference.
apatter:

The rtf file was converted into what appears to be the “.doc” format. It does open okay and has the correct contents. Is there a reason this is converted to a “.doc”?

The rtf.rtf document have embedded Doc contents. Please check this embedded object using MS Word.
apatter:

For the vsdx file, the suggested extension is ".bin"

I have tested the scenario and have managed to reproduce the same issue at my side. For the sake of correction, I have logged this problem in our issue tracking system as WORDSNET-10590.
apatter:

For the dot (Microsoft Word Document Template), it the suggested extension is “.doc”. The file appears to open in Word regardless of whether it is set as “.doc” or ".dot"

I have tested the scenario and have managed
to reproduce the same issue at my side. For the sake of correction, I
have logged this problem in our issue tracking system as WORDSNET-10591.

We apologize for your inconvenience.

Hi Chase,

apatter:

For
the dot (Microsoft Word Document Template), it the suggested extension
is “.doc”. The file appears to open in Word regardless of whether it is
set as “.doc” or ".dot"

Regarding this query, as per my understanding the the embedded object in dot.rtf is Dot file format and Aspose.Words returns Doc file format. We logged this issue as WORDSNET-10591. Please confirm if you are facing the same issue.

Hey Tahir,


Thank you for the detailed response! Unfortunately, I am unable to determine whether the actual format of the file returned by Aspose is “dot” or “doc” (or if there is any major difference between these two file formats). Regardless of what extension is set, it seems to open fine by MS Word. What I can say for sure is that the suggested extension is “doc” when I was expecting “dot”. I see that in the output, the CLSID and ProgID are equal for OLE objects containing dot and doc files. I doubt that the Aspose library is transforming the actual file format.

For the RTF case that I mentioned in a previous post, it seems that when being embedded, there is a transformation into the “doc” format.

Also, I decided to take a look at extracting word macro enabled documents (docm) from OLEs. The result was that MS word complained when I attempted to open the extracted file. Word offered to correct the file and did this correctly in this case at least. I suspect this is the same issue as with the vsdx file format.

I am attaching more example documents that can be used with the example application. Please let me know if there are additional questions.

Hi Chase,

Thanks for sharing the detail.

apatter:

Also, I decided to take a look at extracting word macro enabled documents (docm) from OLEs. The result was that MS word complained when I attempted to open the extracted file. Word offered to correct the file and did this correctly in this case at least. I suspect this is the same issue as with the vsdx file format.

I have tested the scenario and have managed to reproduce the same issue at my side. For the sake of correction, I have logged this problem in our issue tracking system as WORDSNET-10601.

We apologize for your inconvenience.

The issues you have found earlier (filed as WORDSNET-10589;WORDSNET-10590;WORDSNET-10601) have been fixed in this .NET update and this Java update.


This message was posted using Notification2Forum from Downloads module by aspose.notifier.