Serious issue with Aspose.Words creating crazily bloated RTF files

Hello,



I’m having a serious problem with Aspose.Words generating ridiculously large RTF files when converting .doc or .docx files that contain images. For instance:



- A 14.2MB .doc file is converted to an .rtf file that is 458.8MB.

- An 8.6MB .docx file is converted to an .rtf file that is 436.8MB.



In other words, the file can be between 40 and 50 times larger than the original file! A file that started off as only a few megabytes ends up at half a gigabyte!



I have tried:



RtfSaveOptions saveOptions = new RtfSaveOptions();

saveOptions.setExportImagesForOldReaders(false);

saveOptions.setExportCompactSize(true);

doc.save(newPath, saveOptions);



However, this has made no difference whatsoever - the files are coming out just as large as before. I have attached two sample files that exhibit this problem. Please try converting them both to RTF format using the latest version of Aspose.Words and compare the original file size with the resulting RTF file size.



Note that if you look inside the .docx archive at the “media” directory to find all of the images in there, if you you convert all of the image data to hexadecimal, the total hexadecimal data size for all images together encoded in an ASCII string comes to a total of 9.6MB. If I take the images out, though, it’s still swelling to nearly 150MB, so it seems that Aspose is just creating a lot of extra RTF codes that are swelling the file size.



Many thanks,

Keith

Hi there,

Thanks for your inquiry. Please note that Aspose.Words mimics the same behavior as MS Word does. If you convert your document to RTF using MS Word, you will get the same output.

Please let us know if you have any more queries.

Hi,



Thanks for the reply. You’re right, testing the export from Word does result in the same ballooned file size (and OpenOffice has an ever worse result, with an even bigger file). But shouldn’t setExportCompactSize() fix this? According to the documentation for ExportCompactSize(), it reduces the RTF size at the expense of losing right-to-left support. But the enormous RTF file size is caused by inserting a lot of extra \lang… codes - several before each group of letters. Shouldn’t ExportCompactSize() avoid this behaviour to result in a truly compact size? I would argue that ExportCompactSize() is either not working as it should be or is not named correctly if it still results in a 9MB .docx file being saved as a 450MB RTF file.



For instance, if you have a Mac, try opening the .docx file I provided in Nisus Writer Pro and saving as an RTF file from there - it results in a 12MB RTF file. (I have attached the RTF file generated from Nisus so that you can see for yourself - it retains all the necessary features of the .docx file but keeps a reasonable file size.) Looking at the RTF code, it seems to achieve this by not adding in thousands and thousands of \lang codes.



Would it be possible in a future version to enhance ExportCompactSize(), or add another RtfSaveOptions setting, so that we can achieve similar reasonably-sized RTF conversions to Nisus?



Many thanks,

Keith

Hi Keith,

Thanks for your inquiry. I have tested the scenario and have managed to reproduce the same issue at my side. For the sake of correction, I have logged this problem in our issue tracking system as WORDSNET-12126. I have linked this forum thread to the same issue and you will be notified via this forum thread once this issue is resolved.

We apologize for your inconvenience.

Great, thank you for adding it to the list!
All the best,
Keith

Hi Keith,

Thanks for your patience. It is to inform
you that the issue which you are facing is actually not a bug in
Aspose.Words. So, we have closed this issue (WORDSNET-12126) as ‘Not a
Bug’. I am quoting product team comments here for your reference.

The main reason for such file growth is the nature of RTF format. It is uncompressed text format.

The text property that can be desribed by 1 byte in DOC format tooks much more bytes in RTF because property described by control word for example ‘/insrsid10776229’.

Other ‘bad’ of RTF format is that fully resolved properties is written for each elements. That’s why RTF document is much bigger than DOCX. In DOCX we can have paragraph style and all formatting is described in one place. But for RTF we must write this formatting for EVERY paragraph and EVERY run.

The main reason why your file is so large is that it has too much small runs. So in this case, we suggest you please execute Document.joinRunsWithSameFormatting method before saving document. It helps to make output almost twice smaller.

Many thanks for your reply. That .joinRunsWithSameFormatting() method makes a huge difference! I’ve just added that and that seems to solve my problems - I can now import the 400,000-word file with multiple images that was previously choking. The main problem for me is that I have to post-process the RTF files to get them working nicely with the Apple text system, and all those small runs were causing the post-processing to take forever. So this is a great solution, many thanks for pointing me towards it!

All the best,
Keith

Hi Keith,

Thanks for your feedback. Please feel free to ask if you have any question about Aspose.Words, we will be happy to help you.