Extracting MS Word file from OLE Object (C# .NET)

Hi Yves,

I like to share that this case is applicable for both MS Excel and MS Word OLE objects. You may use the same approach for MS Excel OLE object as well.

Many Thanks,

Hello,
maybe I did not make myself clear as this is not my native language.

In Excel I do NOT need to use any 3rd party because it is working there!

See example:

public void UpdateEmbeddedWords()
{
if (_fieldsToUpdate.Count == 0) return;
if (string.IsNullOrEmpty(_path)) return;
var workbookSheets = _workbook.Worksheets;
try
{
foreach (var sheet in workbookSheets)
{
_embeddedWords = sheet.OleObjects;
foreach (var ole in _embeddedWords)
{
// Specify each file format based on the oleobject format type.
if (ole.FileFormatType != FileFormatType.Doc && ole.FileFormatType != FileFormatType.Docx) continue;

var ms = new MemoryStream();
ms.Write(ole.ObjectData, 0, ole.ObjectData.Length);
var word = new Word();
word.LoadFromStream(ms);

foreach (var pair in _fieldsToUpdate)
{
word.SetFormFieldValue(pair.Key, pair.Value);
}

// we have to create our own preview image
using (var renderedImage = new Bitmap(word.WordImageStream))
{
var bitmap = new Bitmap(ImageTrim(renderedImage));
using (var bitmapStream = new MemoryStream())
{
bitmap.Save(bitmapStream, ImageFormat.Png);

// Set OleObject’s frame size to the image size
// ole.Height = bitmap.Height;
// ole.Width = bitmap.Width;
// Set OleObject’s image date to the image stream
ole.ImageData = bitmapStream.ToArray();
}
}

ole.FileFormatType = FileFormatType.Docx;
ole.ObjectData = word.WordStream.ToArray();
// update the preview content of the OLE object
// ole.AutoLoad = true;
}
}
}
catch (Exception ex)
{
_logger.ToLog(“Error in UpdateEmbeddedWords:\n” + ex, Company, "File: " + _path, Logfile, “component”);
}
}

EDIT
The code above shows a method that gets all embedded objects in an excel document, loads the OLE stream and use the Word-Class to handle it there.
.Load() in word do not fail, so the OLE object is fine.

I use the same approach in PowerPoint, but there it fails.


Hi Yves Rausch,

Thank you for sharing the information. I got your point that in case of Aspose.Cells, you don’t require a third party tool to extract MS Word OLE object. Where as in case of Aspose.Slides, one require a third party tool to extract OLE data. I have shared the information with our product team in associated ticket and will get back to you with a feedback as soon as it will be shared by our product team.

Many Thanks,

Any news about .Slides fix for this?
I wait since 2 years now for a chance to handle FormFields OR OLE objects OR ActiveX fields in Slides (several posts about the 3 possibilities).

@rausch,

Our product team has investigate the issue of extraction of embedded Word or Excel from PowerPoint file. Actually, this is not a limitation on Aspose.Slides end but implementation behavior in PowerPoint. Actually, when you embed Word or Excel file as OLE object in PowerPoint directly, it adds the file as OLE Object. You can try adding a Word file in PowerPoint presentation and then saving the presentation. You can then extract the saved presentation using WinRar or other archiving software. There will, “\Presentation with embedded Word.pptx\ppt\embeddings\oleObject1.bin” in extracted presentation. If you rename it to “oleObject1.docx” (for example) and try to open it via Word, we will get an error. Because this is not a correct Word document. And Aspose.Words will not be able to open this embedded object, too. This is not an issue with Aspose.Slides but a limitation in PowerPoint it self. We have internally added an issue with ID SLIDESNET-39130 to investigate any work around to extract the embedded Word OLE and will share feedback with you as soon as it will be further shared by our product team.

Thank you for the feedback.
As Microsoft changed its behavior to “non-goddess-programmers” like us, any chance to address this on their site?

@rausch,

We may help you only with issues related to Aspose.Slide and any thing that is limited by PowerPoint also gets limited by Aspose.Slides as well.

So what you can offer as solution for ?

  • I cannot work with FormFields ActiveX as it corrupts the file
  • I cannot work with embedded Words as the words are corrupt
  • I did not manage to set a text-area or equal to make it usable as FormField to read/write text value

So for my purpose PowerPoint I can’t use at the moment at all.
My renewal for Total.Net is on the decision desktop. Any suggestions?

@rausch,

I regret to share that the at present the support for extracting OLE data using public APIs like Aspose.Words or Aspose.Cells is unavailable and an issue with ID SLIDESNET-39130 has already been created in our issue tracking system and shared with you. I already have shared the only possible approach at the moment over following link. I request for your patience till the time our product team provide the requested support.

You wrote that you posted a link with a possible solution/workaround, can’t find it.
You also mentioned that there are third-party libraries, but your post ends with …

please share those information so I can take a look.

Sidenote: Why I can do this all fine with Excel Aspose.Cells and not with PowerPoint using Apose.Slides?

@rausch,

I suggest you to please click my name “mudassir” in my following post link and it will expand the post with workaround information that I have already shared earlier with you.

Secondly, one cannot compare Aspose.Slides with Aspose.Cells as two API are completely different APIs and may not be compared.

Thanks for the hint, didn’t know I can expand the quote anser like this. I can take a look at this tomorrow then. Thanks.

About this, even if this is different API, that won’t change the way Microsoft handles OLE objects. So PowerPoint has different OLE implementation then Excel then it seems.

@rausch,

I have observed your comments and like to mention that issue SLIDESNET-39130 has been created for this perceptive to implement some internal mechanism for accessing the OLE frame data. For now, the suggested option is the workaround sample code that I have shared with you.

The issues you have found earlier (filed as SLIDESNET-37527) have been fixed in Aspose.Slides for .NET 18.1. Please try using the latest release version and in case you experience any issue or you have any further query, please feel free to contact.

Hello,
what this 37527 should have fixed?
I tried Slides 18.1 with this code:

  foreach (var presentationSlide in presentationSlides)
  {
    var shapes = presentationSlide.Shapes;
    foreach (var shape in shapes)
    {
      var ole = shape as OleObjectFrame;
      if (ole == null || !ole.ObjectProgId.Contains("Word.Document")) continue;
      ole.UpdateAutomatic = true;

      var word = new Word();
      using (var ms = new MemoryStream(ole.ObjectData))
      {
        // creates a file, but is corrupt, word open can fix it and i see the correct content
        // for debug porpuses
        File.WriteAllBytes(@"C:\repositories\latest\src\TQsoft\Test\TQsoft.Test.Common.Office\test_files\out\directstream.docx", ms.ToArray());
        // i now load the file to my word class
        word.LoadFromStream(ms);
        // works fine in excel, crashes here
        word.SetFormFieldValue(name, value);
        ole.ObjectData = word.WordStream.ToArray();
      }
    }
  }

The instance word in this code is just a class around Aspose.Word.
This code works fine with Apose.Cells.

So ole.ObjectData = word.WordStream.ToArray(); crash because the word stream is NULL as load before fails.

This my problem persist or I need help how to get a valid word from PowerPoint file as OLE object.

@rausch,

I like to share that the concerned issue SLIDESNET-37527 has been closed with not a bug category. The details of possible workaround has been shared in my below response before. We have already added a new feature request SLIDESNET-39130 to improve OLE handling and extracting data from that. At present, the only possible solution is workaround as shared in my below response.

Hello,

sorry, I didn’t see that your offered workaround still need to be implemented.
And well I tried, but…

Now I get:
{“Invalid OLE structured storage file”}

Server stack trace: 
   at OpenMcdf.Header.CheckSignature() in C:\Users\Federico\Documents\Visual Studio 2015\Projects\test_openmcdf\test_openmcdf\sources\OpenMcdf\Header.cs:line 303
   at OpenMcdf.Header.Read(Stream stream) in C:\Users\Federico\Documents\Visual Studio 2015\Projects\test_openmcdf\test_openmcdf\sources\OpenMcdf\Header.cs:line 259
   at OpenMcdf.CompoundFile.Load(Stream stream) in C:\Users\Federico\Documents\Visual Studio 2015\Projects\test_openmcdf\test_openmcdf\sources\OpenMcdf\CompoundFile.cs:line 701
   at OpenMcdf.CompoundFile.LoadStream(Stream stream) in C:\Users\Federico\Documents\Visual Studio 2015\Projects\test_openmcdf\test_openmcdf\sources\OpenMcdf\CompoundFile.cs:line 747
   at OpenMcdf.CompoundFile..ctor(Stream stream) in C:\Users\Federico\Documents\Visual Studio 2015\Projects\test_openmcdf\test_openmcdf\sources\OpenMcdf\CompoundFile.cs:line 457
   at TQsoft.Common.Office.Powerpoint.SetFormFieldValue(String name, String value) in C:\repositories\latest\src\TQsoft\Common\TQsoft.Common.Office\Powerpoint.cs:line 263
   at System.Runtime.Remoting.Messaging.StackBuilderSink._PrivateProcessMessage(IntPtr md, Object[] args, Object server, Object[]& outArgs)
   at System.Runtime.Remoting.Messaging.StackBuilderSink.SyncProcessMessage(IMessage msg)

Exception rethrown at [0]: 
   at System.Runtime.Remoting.Proxies.RealProxy.HandleReturnMessage(IMessage reqMsg, IMessage retMsg)
   at System.Runtime.Remoting.Proxies.RealProxy.PrivateInvoke(MessageData& msgData, Int32 type)
   at TQsoft.Common.Office.Powerpoint.SetFormFieldValue(String name, String value)
   at TQsoft.Test.Common.Office.Program.Main() in C:\repositories\latest\src\TQsoft\Test\TQsoft.Test.Common.Office\TQsoft.Test.Common.Office\Program.cs:line 197
   at System.AppDomain._nExecuteAssembly(RuntimeAssembly assembly, String[] args)
   at System.AppDomain.ExecuteAssembly(String assemblyFile, Evidence assemblySecurity, String[] args)
   at Microsoft.VisualStudio.HostingProcess.HostProc.RunUsersAssembly()
   at System.Threading.ThreadHelper.ThreadStart_Context(Object state)
   at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx)
   at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx)
   at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state)
   at System.Threading.ThreadHelper.ThreadStart()

It looks like that this additional package can not handle this object as well or what I do wrong?

My code as I use it:

  var presentationMasterSlides = _presentation.Masters;
  foreach (var presentationMasterSlide in presentationMasterSlides)
  {
    var shapes = presentationMasterSlide.Shapes;
    foreach (var shape in shapes)
    {
      if (!(shape is OleObjectFrame)) continue;
      var ole = shape as OleObjectFrame;
      if (!ole.ObjectProgId.Contains("Word.Document")) continue;
      ole.UpdateAutomatic = true;

      var word = new Word();
      using (var ms = new MemoryStream(ole.ObjectData))
      {
        ms.Position = 0;
        var compoundFile = new CompoundFile(ms);
        var stream = compoundFile.RootStorage.GetStream("Package");
        var packageData = stream.GetData();

        using (var packageDataStream = new MemoryStream(packageData))
        {
          // creates a file, but is corrupt, word open can fix it and i see the correct content
          // for debug porpuses
          File.WriteAllBytes(@"C:\repositories\latest\src\TQsoft\Test\TQsoft.Test.Common.Office\test_files\out\directstream_master.docx", packageDataStream.ToArray());
          // i now load the file to my word class
          word.LoadFromStream(packageDataStream);
          // works fine in excel, crashes here
          word.SetFormFieldValue(name, value);
          ole.ObjectData = word.WordStream.ToArray();
        }
      }
    }
  }

@rausch,

The solution that I shared involve third party API as workaround. I request you to please share a working sample project along with source presentation. I will investigate that w.r.t Aspose.Slides perspective and share if we can offer any workaround or request you to hold till feature is availble in Aspose.Slides.

Hello,
I made a simple test project and there your solution seems to work. I need investigate the cause and come back to you if I need more help. Thanks you so far!

Ok, I managed to read the object, and when I save it I have a correct word document now.
No I run into the next issue.
I do modify the word stream, and then I want to put it back into the PowerPoint file.

so I do

ole.ObjectData = wordStream.ToArray();

But then the embedded document inside the PowerPoint is corrupted. So somehow I need to convert it back into a correct OLE format PowerPoint “like”. I know that is third party, but as you gave first workaround, can you help here, too?

UPDATE

Tried this

          compoundFile = new CompoundFile();
          compoundFile.RootStorage.AddStream("Package").SetData(word.WordStream.ToArray());
          ole.ObjectData = compoundFile.RootStorage.GetStream("Package").GetData();

This seems to work.
Can you verify?

New problem then:
Same as Excel the preview image is not up to date.
In Aspose.cells I have ole.ImageData, so I render the preview and then save it.
How I can do in Aspose.Slides?
I found this article but in current version PcitureId is not available.

Thanks in advance. Yves