Saving as PDF or HTML - files names and locations for image files


#1

Rather than just customers requesting or proposing features in this forum, it would be nice if it can work the other way around too.

For example, I’m thinking what could be a good solution for naming and locating image files when the document is saved in Aspose.Pdf.Xml or HTML format.

You need to understand that images are not embedded into XML or HTML file, but they must be placed into separate files and the document file references them. This is true even when you save XML or HTML into a memory stream - the images must still go into files.

Current approach
Current architecture to create Aspose.Pdf.Xml files can be described as follows:
1. Image file name is a GUID found in the Word file (all images have guids) plus an appropriate extension such as .jpg.
2. If you save into a file, the images are created in the same directory as the destination file.
3. If you save into a stream (the name of the destination file is not known to Aspose.Word) then the images are saved into Windows temporary folder.
4. Because Aspose.Pdf.Xml is usually not the “end” format, and the document is converted into Pdf using Aspose.Pdf as a next step, Aspose.Pdf provides a special option IsImagesInXmlDeleteNeeded, which specifies to delete all image files referenced by the XML file after the file is processed.

Potential problem
We have not had reports of this problem yet, but in a webserver scenario with multiple users requesting the same document at the same time there could be multiple threads attempting to save the document as Aspose.Pdf.Xml and thus multiple threads could be trying to save same images into files with same names. Apparently, this could result in file access violations. Later, when Aspose.Pdf tries to delete the image files after processing is finished for one document, it is not finished for other documents so other documents might get their images missing.

Possible solutions
There is a number of alternative solutions we are considering including making sure all image files names are globally unique even if the same document is generated at the same time for several users.

The HTML writer we are working on now requires the same solution.

It would be nice to hear your ideas about where to put image files and how to name them when saving in HTML or in XML format. Remember the scalability.




#2

Starting with Aspose 1.7.1 the behaviour for image filenames when saving as PDF or HTML is as follows:

When saving to a file (filename is available):
1. The images are created in the same folder as the document.
2. Image file names are .xxxx., where xxxx is just an incremented number.

When saving to a stream (no document filename is available):
1. The images are stored in a temporary folder obtained from System.IO.Path.GetTempFolder().
2. Image files names are Aspose.Word.dddd.xxxx., where dddd is a date-time stamp (same for all images of the document) and xxxx is an incremented number.

Please note than Aspose.Word does not delete saved files and care should be taken that you don’t end up with accumulating thousands of image files in the temporary folder. The easiest is to periodically delete all Aspose.Word.xxx files in the temorary folder.