This issue is linked to Can’t find embedded OLE object in DOC file - Free Support Forum - aspose.com and also to Reading RTF File Containing an OLE Embedded Object - Free Support Forum - aspose.com.
I am analyzing a MALWARE containing DOC file. I have extracted an embedded RTF document from the DOC file, as described in the first link above by using OpenXML. The RTF file contains 2 OLE objects, but I cannot seem to find them using Aspose.Words.
I have attached the INFECTED file in a ZIP archive. The password for the zip is ‘infected’. Please beware opening this file, since it contains active malware:
INFECTED.zip (8.4 KB)
What I did:
- I used ‘rtfobj’ from the ‘oletools’ package and received the following output:
C:\>rtfobj INFECTED.RTF
rtfobj 0.60.1 on Python 3.11.4 - http://decalage.info/python/oletools
THIS IS WORK IN PROGRESS - Check updates regularly!
Please report any issue at https://github.com/decalage2/oletools/issues
===============================================================================
File: 'INFECTED.RTF' - size: 44146 bytes
---+----------+---------------------------------------------------------------
id |index |OLE Object
---+----------+---------------------------------------------------------------
0 |000026DEh |Not a well-formed OLE object
---+----------+---------------------------------------------------------------
1 |00003E68h |format_id: 2 (Embedded)
| |class name: b'OLE2LINK'
| |data size: 3584
| |MD5 = 'ed315c3b36a83206dfd1bba013b91575'
| |CLSID: 88D96A0C-F192-11D4-A65F-0040963251E5
| |SAX XML Reader 6.0 (msxml6.dll)
---+----------+---------------------------------------------------------------
-
I tried using code similar to the one mentioned here: Reading RTF File Containing an OLE Embedded Object - Free Support Forum - aspose.com but there are no ‘Shape’ objects found in the document.
-
Running the following code results in a file where the first of the two OLE objects has been removed:
Aspose.Words.Loading.LoadOptions options = new Aspose.Words.Loading.LoadOptions();
var doc = new Document("INFECTED.RTF", options);
doc.Save("OUTPUT.RTF", FileFormatUtil.ExtensionToSaveFormat(".rtf"));
The output file is considerably smaller than the input file:
12/26/2023 11:28 AM 44,146 INFECTED.RTF
12/26/2023 11:15 AM 28,125 OUTPUT.RTF
Also, running ‘rtfobj’ on the output file shows the ‘malformed ole object’ was removed:
C:\> rtfobj "OUTPUT.RTF"
rtfobj 0.60.1 on Python 3.11.4 - http://decalage.info/python/oletools
THIS IS WORK IN PROGRESS - Check updates regularly!
Please report any issue at https://github.com/decalage2/oletools/issues
===============================================================================
File: 'OUTPUT.RTF' - size: 28125 bytes
---+----------+---------------------------------------------------------------
id |index |OLE Object
---+----------+---------------------------------------------------------------
0 |00000F84h |format_id: 2 (Embedded)
| |class name: b'xmlfile'
| |data size: 3584
| |MD5 = '49c878c5452811ae4d8113151b9b277f'
| |CLSID: 88D96A0C-F192-11D4-A65F-0040963251E5
| |SAX XML Reader 6.0 (msxml6.dll)
---+----------+---------------------------------------------------------------
@Buffer2018 It looks like Aspose.Words simply ignores the malformed OLE objects in the document. We will further investigate the issue and check whether the current Aspose.Words behavior is intended.
We have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.
Issue ID(s): WORDSNET-26414
You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.
Thank you for your fast reply.
However, I don’t understand whether it is possible to identify and remove the OLE object that is present in the file, not malformed and is actually saved in the RTF file after calling Document.Save()
.
For emphasis, the following code:
var doc = new Document("infected.rtf");
var nodes = doc.GetChildNodes(NodeType.Any, true);
foreach (var node in nodes)
{
Console.WriteLine($"got node {node}, type={node.NodeType}");
}
Produces the following output:
got node Aspose.Words.Section, type=Section
got node Aspose.Words.Body, type=Body
got node Aspose.Words.Paragraph, type=Paragraph
got node Aspose.Words.Run, type=Run
got node Aspose.Words.Paragraph, type=Paragraph
got node Aspose.Words.Paragraph, type=Paragraph
got node Aspose.Words.Fields.FieldStart, type=FieldStart
got node Aspose.Words.Run, type=Run
got node Aspose.Words.Fields.FieldSeparator, type=FieldSeparator
got node Aspose.Words.Run, type=Run
got node Aspose.Words.Paragraph, type=Paragraph
got node Aspose.Words.Fields.FieldEnd, type=FieldEnd
got node Aspose.Words.Fields.FieldStart, type=FieldStart
got node Aspose.Words.Run, type=Run
got node Aspose.Words.Fields.FieldSeparator, type=FieldSeparator
got node Aspose.Words.Run, type=Run
got node Aspose.Words.Paragraph, type=Paragraph
got node Aspose.Words.Fields.FieldEnd, type=FieldEnd
@Buffer2018 OLE objects in Aspose.Words DOM are represented as shapes with OLE format. So you can use the following code to remove all OLE objects in the document:
Document doc = new Document(@"C:\Temp\in.docx");
// Get all OLE objects and remove them.
doc.GetChildNodes(NodeType.Shape, true).Cast<Shape>()
.Where(s => s.OleFormat != null).ToList()
.ForEach(s => s.Remove());
doc.Save(@"C:\Temp\out.docx");
But in your particular case Aspose.Words does not read OLE objects from your document. Probably because they are invalid from Aspose.Words point of view.
Strange, because the embedded ole object is saved by Aspose to the output file.
Will this be researched as part of WORDSNET-26414 as well?
Thanks!
Uriel@Bufferzone
@Buffer2018 Yes, we will investigate, which OLE objects are not detected by Aspose.Words in your document.
@Buffer2018 We have completed analyzing the issue.
The first OLE object of the source RTF document is an OLE autolink to a Word.Document.8
document that is also embedded. Object type stored in the OLE object header is Linked
.
The second OLE object is an OLE autolink to xmlfile
. Object type stored in the OLE object header is Embedded
.
The both OLE objects are converted to LINK
fields when saving to DOCX format.
We have changed the links in the source document to refer to local files, and have replaced the embedded Word.Document.8
document with someone taken from another file (attached fixed.rtf).
fixed.zip (10.1 KB)
When resaving the fixed.rtf document in MS Word to DOCX format, MS Word convers the OLE objects to LINK
fields like Aspose.Words does. But the second link is different than in Aspose.Words. Aspose.Words takes the link from the \x0003LinkInfo
stream, but MS Word takes from the \x0001Ole
stream.
It is not clear whether these two links can be different in “usual” OLE objects, or this is a feature of the malware that \x0003LinkInfo
contains a harmless link, but the OLE object inside contains the harmful link, which is actually used by MS Word.
Looks like Aspose.Words currently reads/parses only header parts of the \x0001Ole
stream, at least the data where the harmful link is located is not read/parsed by Aspose.Words.
It looks correct that the OLE objects are replaced with LINK fields by Aspose.Words.
We can fix the issue that the link is different than in MS Word. Also, the difference with MS Word when saving to RTF format is that MS Word preserves OLE objects.
Could you please describe your vision of expected Aspose.Words behavior in this case?