InsertHTML and image references being saved as files rather than being embedded

bryangrossman · June 5, 2008, 12:25pm

I have comments / questions about the InsertHtml call and image references within the html block. I am looking for a certain type of functionality when the document is saved based on the type it is saved as. Currently when I save the document as either a Html or Doc, any images referenced are saved locally relative to the Aspose library and then a local file reference is dropped in as the image source. What I would like to do is have the images embedded in the document when saved as a “.doc” and when saved as “.html” the image sources left alone. How would I configure Aspose to accomplished this?
Bryan Grossman

Klepus · June 5, 2008, 4:24pm

Hello!
Thank you for your interesting question.
As I can see you’d like Aspose.Words to load the referenced files into the model automatically when you are inserting any image references with DocumentBuilder.InsertHtml. And this should be an option. Okay, it sounds like a reasonable improvement but the task is quite specific.
Since Aspose.Words.Document doesn’t know in what format you are planning to save it after construction there is really some option needed. It could be either a parameter to InsertHtml or a property of DocumentBuilder or of Document itself.
Some time ago we had a similar request on InsertHtml parameterization. It is unresolved yet but maybe you can help us to see the full picture:
Please advise how you imagine this improvement from the public API view.
Regards,

bryangrossman · June 6, 2008, 10:30am

Thank you for your swift response!
If it were up to me to implement this functionality in the Aspose.Words API, I would put this functionality in an overloaded InsertHtml call. Mostly for flexibility of use & minimal impact to other users code already written. This way a user could have the image file referenced in one InserHtml call, embedded in the document in another InsertHtml call and have the image references left alone in another InsertHtml call all in the same document. The issue you would have to deal with then is what to do with the images that are embedded in the document when the user chooses to save in a format that is incompatible with embedded images. In that case you could reuse the code that would save the images out as file references on the save.
But given my limited understanding of the internal operation of the library itself my suggestion may not be feasible. I find it fascinating that this has not been requested previously. In my mind I expected a “.doc” file to default to embed the images, but I guess you are dealing with a rather exotic case of how to process inserted html and how the document is saved and when these decisions are made.
In the meantime, I still need to accomplish this task with the current library because I imagine that even if you started working on this solution now it might not make it into the next hot fix. So, given the current library and trying insert a block of html with a referenced image in that block of html. How would one go about getting that image inserted into the document within that block of html without having to molest the html?
Bryan Grossman

Klepus · June 8, 2008, 5:57am

Hello!
Thank you for your thoughtful feedback.
I think that an extra parameter to InsertHtml might be not what we expect as the best. First we should consider “full picture”, not only this case. There are potentially many behavior aspects we might want to parameterize regarding InsertHtml. And of cause you know what this could lead: combinatorial explosion of InsertHtml overloads. For instance, in that thread I gave you a link to we were requested to support style mapping and allow passing some map right to InsertHtml. Two features require four overloads, three – eight etc.
Adding a property to DocumentBuilder looks better from the first sight. But we should not forget import of complete documents to Aspose.Words document model. Import can be made right in the Document constructor:
Document doc = new Document(“DinnerInvitation.html”);
And there are already several constructors and Load methods in the library (for stream, for file, with or without format detection). If we add an extra parameter to them we’ll have the same combinatorial explosion.
In Aspose.Words we have SaveOptions but we have no LoadOptions. But there are also some irresolvable degrees of freedom when documents are being loaded. These two requests are good examples what could be done with LoadOptions or however we call it.
You are right, that’s not fast to make such changes in the library. At least we can suggest a workaround. When you are inserting an HTML fragment with an image reference you can switch to using DocumentBuilder.InsertImage. To detect this situation you probably need some regular expression parsing on the fragment. In general that fragment might contain anything else and you shouldn’t forget to process its remainders. Another idea is inserting the whole fragment and after that tuning ImageData of appropriate newly created Shape. Third idea is doing the same but after all insertions and before further processing. You can query all shapes in the document and suggest on each of them whether to load data from references or leave them as they are.
See these articles in our documentation:
https://reference.aspose.com/words/net/aspose.words.drawing/shape/
https://reference.aspose.com/words/net/aspose.words.drawing/imagedata/
Regarding further document saving that’s everything straightforward. I wrote you that a Document instance doesn’t “know” in what format it will be saved. Document class is the document model. When we save in a format without embedded images we put images separately on the file system. You can see that when exporting to HTML or Aspose.Pdf XML (intermediate format for PDF export).
I have created a new issue for your request in our defect database:
#5286 – Embed referenced resources optionally in HTML import
We’ll investigate this further but I cannot promise you any release date. At least this case will be addressed and some update provided.
Regards,

bryangrossman · June 19, 2008, 10:38am

Thank you for your response! But I am having trouble understanding how to implement one of your solutions. The solution you mentioned that I chose to implement was this one:
“… Third idea is doing the same but after all insertions and before further processing. You can query all shapes in the document and suggest on each of them whether to load data from references or leave them as they are.”
I tried to implement this solution but I am having trouble understanding how to implement it. Here is a snippet of code that I came up with based on the links you sent me:

NodeCollection shapes = doc.GetChildNodes(NodeType.Shape, true, true);
ArrayList shapesToDel = new ArrayList();
foreach (Shape s in shapes)
{
    // if the shape has an image and the image is linked...
    if (s.HasImage)
    {
        // add the shape to to a delete list...
        shapesToDel.Add(s);
        // move the builder to the location of the shape...
        builder.MoveTo(s);
        // insert the image at that location....
        builder.InsertImage(s.ImageData.ImageBytes);
        // delete the image reference....
        // new FileInfo(s.ImageData.SourceFullName).Delete();
    }
}
// now delete the shapes that contained the image references....
foreach (Shape shape in shapesToDel)
{
    shape.Remove();
}

The problem is that it operates just as if I had done nothing… the images are not loaded into the document and there is an X where the image should be. I am sure I misunderstood what you are suggesting from this one line “…suggest on each of them whether to load data from references…”. I could not find a property that was obvious to me on the shape or the imagedata to “suggest” the image to save it’s data in the document. I am a little lost here, obviously… Any suggestions would be greatly appreciated!
Bryan Grossman

Klepus · June 19, 2008, 5:36pm

Hello!
Maybe I missed something in the explanation. In the current implementation Aspose.Words always embeds images referenced by HTML input. So we cannot suggest what links we’d like to retain after fragment insertion. In the shapes that were just inserted information about image sources is lost.
If you need all images to be embedded then everything is done. What can we do if you need to link images? After insertion file names are lost but you can match them with newly inserted shapes.

Find the shapes traversing nodes from the point your DocumentBuilder initially was.
In the fragment find all src=FileName. That could be normally done with regular expressions. Be aware of different quotation types here. The images should appear in the same order in the fragment and in the document even if some of them are inaccessible.
Using existing interface of Shape class you cannot remove image data so we’ll have to re-create shapes:
Shape image = new Shape(currentDoc, ShapeType.Image);
Newly created Shape contains or refers nothing. At this point you can add source file name with this property:
Image.ImageData.SourceFullName = fileNameFromFragment;
Insert the new Shape and remove the original. You need only one loop. Everything could be done in the one you are traversing the nodes. You can use InsertAfter of the parent node and after that remove the original Shape.

Yes, this doesn’t look so easy. Please let me know if you experience any difficulties with this approach.
Regards,

bryangrossman · June 20, 2008, 11:22am

Ok… First I must ask forgiveness because I was passing image virtual paths to Aspose and there was no server context for Aspose to find the images.
But I seem to have uncovered another problem and I am sure it’s an IIS permissions issue, but I am completely at a loss as to why I am having this problem and was hoping you might be able to help me figure this out.
It has to do with these image references and the ASP.NET Development Server Vs. deploying on an IIS 6.0 server. I am seeing difference in the way Aspose.Words operates when accessing the image references in my local project.
Aspose.Words has no problem accessing the image URL references when running in a debug session on a ASP.NET Development Server, the image are loaded and the document is built with no issues. But when the very same code is deployed to an IIS 6.0 server it can’t seem to locate the local images. Now here is the kicker, If I open a browser on the server the application is deployed to and manually type in the image URL it come up with no problem. But if I include a reference to an image URL that is NOT located in the local project when inserting a block of HTML when building a document using Aspose.Words it loads with no problem. I checked the images and they were deployed to the server. But Aspose can’t see them when running under IIS 6.0. There is obviously some sort of setting in IIS that I need to set for either the directory or images… but I am oblivious to what it might be… Any help would be greatly appreciated….

Bryan Grossman

Klepus · June 20, 2008, 6:09pm

Hello!
According to the current logic Aspose.Words always tries to load referenced images to the document model. If it fails then “no image” image is added instead, namely a cross. But I repeat, even if images are not accessible (or URLs are fictive in your case) you might be able to match URLs to Shape objects. The workaround is applicable.
Aspose.Words doesn’t bring to file handling anything special. Files are opened with ordinary APIs. If they are accessible everything works fine. If they aren’t you see a red cross in place of an image. There is no magic. Web server is restricted by default to its home directory without access to all host had disk. You can extend permission but this is not recommended.
Regards,

alexey.noskov · September 30, 2011, 9:23am

The issues you have found earlier (filed as WORDSNET-1701) have been fixed in this .NET update and in this Java update.