Embeded Document within it

Hello Team,

I am working with all format of documents.(docx/pptx/pdf/xlsx).
I want to know if any document contains embedded document within it or not.
Is there any way I can find such documents?

My requirement is I am converting all documents into html …but if document contains embedded document within it…I have to reject those documents…
can you suggest me the way to find such document with embedded document in it for all formats!

@kotharib2,

Thanks for your query.

  1. To detect if an MS Excel file contains an OLE Object or not, you may try to use the following sample code using Aspose.Cells:
    e.g
    Sample code:

      Workbook workbook = new Workbook("e:\\test2\\Book1.xlsx");
      //Get the embedded objects collection in the first worksheet.
      OleObjectCollection  oles = workbook.Worksheets[i].OleObjects;
      int i = oles.Count; //If the count is >0 then it contains embedded Ole object(s). So, you will write your own code here.
    
     //And, similarly you may loop through other worksheets (you may use for loop, etc.) for checking.
    
     //Save the workbook to HTML
      workbook.Save("e:\\test2\\out1.html", SaveFormat.Html);
    
  2. The embedded documents inside presentations are in the form of OLE objects. Using Aspose.Slides you may try the following sample code:
    e.g
    Sample code:

         public static void FindOleObjects()
         {
             Presentation pres=new Presentation("test.pptx");
             foreach (ISlide slide in pres.Slides)
             {
                 foreach (IShape shape in slide.Shapes)
                 {
                     if (shape is IOleObjectFrame)
                     {
                         //Simply hides the shapes
                         shape.Hidden = true;
                     }
                 }
    
             }
         }
    

The above sample identifies such shapes and hides them.

  1. In Aspose.PDF API, Document class exposes EmbeddedFiles collection which you can use. Below is a link for further information about this collection:
    https://docs.aspose.com/pdf/net/attachments/

Regarding detecting embedded objects in MS Word documents, we will share more details soon.

@kotharib2,

Following code example finds the OLE Object and removes it from the document using Aspose.Words API:
e.g
Sample code:

Document doc = new Document(MyDir + "input.docx");
foreach (Shape shape in doc.GetChildNodes(NodeType.Shape, true))
{
if (shape.OleFormat != null)
shape.Remove();
} 

Hope, this helps a bit.

Hi Thanks for your quick reply…
This helped a lot…

I even need names of the Ole shapes for all format…

means I need to know which type of shape I am removing for all formats…

can you guide me for this.

@kotharib2,

You should browse attributes of the relevant APIs further. For example, in MS Excel spreadsheet, you may easily get the required values for the concerned attributes using Aspose.Cells APIs:
e.g
Sample code:

...........     
// Get the OleObject Collection in the first worksheet. 
    Aspose.Cells.Drawing.OleObjectCollection oles = workbook.Worksheets[z].OleObjects;
     // Loop through all the oleobjects and extract each object.
     // In the worksheet.
     for (int i = 0; i < oles.Count; i++)
     {
            Aspose.Cells.Drawing.OleObject ole = oles[i];
             string title = ole.Title;
             string text = ole.Text;
             string name = ole.Name;
     
             string fileName = ole.ObjectSourceFullName;
             .........
    }

@kotharib2,

For Aspose.Slides API, you can cast Shape to IOleObjectFrame using following sample code. It is an addition to previously shared code segment:
e.g
Sample code:

if (shape is IOleObjectFrame)
                 {
                     //Simply hides the shapes
                     shape.Hidden = true;
		     IOleObjectFrame ole=(IOleObjectFrame)shape;
                 }

The following features will then be exposed for your convenience:

Also, using Aspose.Words API, following code example shows how to get the OLE file name and extension:
e.g
Sample code:

Document doc = new Document(MyDir + "input.docx");
                 
foreach (Shape shape in doc.GetChildNodes(NodeType.Shape, true))
{
    if (shape.OleFormat != null)
    {
        Console.WriteLine(shape.OleFormat.SuggestedFileName);
        Console.WriteLine(shape.OleFormat.SuggestedExtension);
    }
}

Hi @Amjad_Sahi

Thanks for your help,
For pptx everything is perfect

For docx
I am getting embedded documents detail as a whole doc
but I need to fetch with page number…

I need to know in which page there is embedded document for word file

how to iterate shapes childnode pagewise?

@kotharib2,

Thanks for your inquiry.

Please use LayoutCollector.GetStartPageIndex method to get page number where the node begins.

Document doc = new Document(MyDir + "input.docx");
LayoutCollector collector = new LayoutCollector(doc);
foreach (Shape shape in doc.GetChildNodes(NodeType.Shape, true))
{
    if (shape.OleFormat != null)
    {
        Console.WriteLine(shape.OleFormat.SuggestedFileName);
        Console.WriteLine(shape.OleFormat.SuggestedExtension);
        Console.WriteLine("Page number : " + collector.GetStartPageIndex(shape));
    }
}

If you want to get the shape nodes of a specific page, you need to iterate over shapes nodes of document and check the page number using LayoutCollector.GetStartPageIndex method.

Thanks for your help :slight_smile: