Extraneous font information in generated RTF

Hi,

I’m evaluating Aspose.Net. What I need to do is to extract formatted snippets of word documents and store them in a Database. Later on these snippets are edited with a text edit, for example rtf.

I have the following methods that create the snippets and save them as rtf:

private Document CreateDocument(List runs)
{
    Document result = new Document();

    if (runs.Count > 0)
    {
        NodeImporter nodeImporter = new NodeImporter(runs[0].Document, result, ImportFormatMode.KeepSourceFormatting);

        foreach (Run run in runs)
        {
            Node importedNode = nodeImporter.ImportNode(run, true);
            result.Sections[0].Body.Paragraphs[0].AppendChild(importedNode);
        }
    }

    return result;
}

private void OnTransUnitFound(object sender, TransUnitFoundEventArgs e)
{

    Document sourceSegmentsDoc = CreateDocument(e.SourceSegmentRuns);
    MemoryStream ms = new MemoryStream();
    sourceSegmentsDoc.Save(ms, SaveFormat.Rtf);

    ms.Seek(0, SeekOrigin.Begin);
    StreamReader sr = new StreamReader(ms);

    string rtf = sr.ReadToEnd();
}

However, the rtf seems to include a lot of unused styles and is very big:

{\rtf1\ansi\ansicpg1252\uc0\stshfdbch0\stshfloch0\stshfhich0\stshfbi0\deff0\adeff0{\fonttbl{\f0\froman\fcharset0\fprq2{*\panose 02020603050405020304}Times New Roman;}{\f1\froman\fcharset0\fprq2{*\panose 05050102010706020507}Symbol;}{\f2\fswiss\fcharset0\fprq2{*\panose 020b0604020202020204}Arial;}}{\colortbl;\red255\green0\blue0;}{\stylesheet{\s0\snext0\sqformat\spriority0\ltrpar\li0\lin0\ri0\rin0\ql\faauto\rtlch\afs24\ltrch\fs24 Normal;}{*\cs10\additive\ssemihidden\spriority0 Default Paragraph Font;}}{*\rsidtbl\rsid10976062}{*\generator Aspose.Words for .NET 9.5.0.0;}{\info\version1\edmins0\nofpages1\nofwords0\nofchars0\nofcharsws0}\deflang1033\deflangfe2052\adeflang1025\jexpand\showxmlerrors1\validatexml1{*\wgrffmtfilter 013f}\viewkind1\viewscale100\fet0\ftnbj\aenddoc\ftnrstcont\aftnrstcont\ftnnar\aftnnrlc\widowctrl\nospaceforul\nolnhtadjtbl\alntblind\lyttblrtgr\dntblnsbdb\noxlattoyen\wrppunct\nobrkwrptbl\expshrtn\snaptogridincell\asianbrkrule\htmautsp\noultrlspc\useltbaln\splytwnine\ftnlytwnine\lytcalctblwd\allowfieldendsel\lnbrkrule\formshade\nojkernpunct\dghspace180\dgvspace180\dghorigin1800\dgvorigin1440\dghshow1\dgvshow1\dgmargin\pgbrdrhead\pgbrdrfoot\sectd\sectlinegrid360\pgwsxn12240\pghsxn15840\marglsxn1800\margrsxn1800\margtsxn1440\margbsxn1440\guttersxn0\headery708\footery708\colsx708\ltrsect\sectdefaultcl\pard\plain\itap0\s0\ltrpar\li0\lin0\ri0\rin0\ql\faauto\rtlch\afs24\ltrch\fs24{\rtlch\afs24\ltrch\b\fs24\cf1 Evaluation Only. Created with Aspose.Words. Copyright 2003-2010 Aspose Pty Ltd.}{\rtlch\afs24\ltrch\fs24\par}\pard\plain\itap0\s0\ltrpar\li0\lin0\ri0\rin0\ql\faauto\rtlch\afs24\ltrch\fs24{\rtlch\afs24\ltrch\fs32\f2\lang1024\langnp1024\langfe1024\langfenp1024\cs10\v\noproof B A U T E C H N I K}{\rtlch\afs24\ltrch\fs24\insrsid10976062\par}{*\latentstyles\lsdstimax267\lsdlockeddef0\lsdsemihiddendef0\lsdunhideuseddef0\lsdqformatdef0\lsdprioritydef0{\lsdlockedexcept\lsdqformat1 Normal;\lsdqformat1 heading 1;\lsdsemihidden1\lsdunhideused1\lsdqformat1 heading 2;\lsdsemihidden1\lsdunhideused1\lsdqformat1 heading 3;\lsdsemihidden1\lsdunhideused1\lsdqformat1 heading 4;\lsdsemihidden1\lsdunhideused1\lsdqformat1 heading 5;\lsdsemihidden1\lsdunhideused1\lsdqformat1 heading 6;\lsdsemihidden1\lsdunhideused1\lsdqformat1 heading 7;\lsdsemihidden1\lsdunhideused1\lsdqformat1 heading 8;\lsdsemihidden1\lsdunhideused1\lsdqformat1 heading 9;\lsdsemihidden1\lsdunhideused1\lsdqformat1 caption;\lsdqformat1 Title;\lsdqformat1 Subtitle;\lsdqformat1 Strong;\lsdqformat1 Emphasis;\lsdsemihidden1\lsdpriority99 Placeholder Text;\lsdqformat1\lsdpriority1 No Spacing;\lsdpriority60 Light Shading;\lsdpriority61 Light List;\lsdpriority62 Light Grid;\lsdpriority63 Medium Shading 1;\lsdpriority64 Medium Shading 2;\lsdpriority65 Medium List 1;\lsdpriority66 Medium List 2;\lsdpriority67 Medium Grid 1;\lsdpriority68 Medium Grid 2;\lsdpriority69 Medium Grid 3;\lsdpriority70 Dark List;\lsdpriority71 Colorful Shading;\lsdpriority72 Colorful List;\lsdpriority73 Colorful Grid;\lsdpriority60 Light Shading Accent 1;\lsdpriority61 Light List Accent 1;\lsdpriority62 Light Grid Accent 1;\lsdpriority63 Medium Shading 1 Accent 1;\lsdpriority64 Medium Shading 2 Accent 1;\lsdpriority65 Medium List 1 Accent 1;\lsdsemihidden1\lsdpriority99 Revision;\lsdqformat1\lsdpriority34 List Paragraph;\lsdqformat1\lsdpriority29 Quote;\lsdqformat1\lsdpriority30 Intense Quote;\lsdpriority66 Medium List 2 Accent 1;\lsdpriority67 Medium Grid 1 Accent 1;\lsdpriority68 Medium Grid 2 Accent 1;\lsdpriority69 Medium Grid 3 Accent 1;\lsdpriority70 Dark List Accent 1;\lsdpriority71 Colorful Shading Accent 1;\lsdpriority72 Colorful List Accent 1;\lsdpriority73 Colorful Grid Accent 1;\lsdpriority60 Light Shading Accent 2;\lsdpriority61 Light List Accent 2;\lsdpriority62 Light Grid Accent 2;\lsdpriority63 Medium Shading 1 Accent 2;\lsdpriority64 Medium Shading 2 Accent 2;\lsdpriority65 Medium List 1 Accent 2;\lsdpriority66 Medium List 2 Accent 2;\lsdpriority67 Medium Grid 1 Accent 2;\lsdpriority68 Medium Grid 2 Accent 2;\lsdpriority69 Medium Grid 3 Accent 2;\lsdpriority70 Dark List Accent 2;\lsdpriority71 Colorful Shading Accent 2;\lsdpriority72 Colorful List Accent 2;\lsdpriority73 Colorful Grid Accent 2;\lsdpriority60 Light Shading Accent 3;\lsdpriority61 Light List Accent 3;\lsdpriority62 Light Grid Accent 3;\lsdpriority63 Medium Shading 1 Accent 3;\lsdpriority64 Medium Shading 2 Accent 3;\lsdpriority65 Medium List 1 Accent 3;\lsdpriority66 Medium List 2 Accent 3;\lsdpriority67 Medium Grid 1 Accent 3;\lsdpriority68 Medium Grid 2 Accent 3;\lsdpriority69 Medium Grid 3 Accent 3;\lsdpriority70 Dark List Accent 3;\lsdpriority71 Colorful Shading Accent 3;\lsdpriority72 Colorful List Accent 3;\lsdpriority73 Colorful Grid Accent 3;\lsdpriority60 Light Shading Accent 4;\lsdpriority61 Light List Accent 4;\lsdpriority62 Light Grid Accent 4;\lsdpriority63 Medium Shading 1 Accent 4;\lsdpriority64 Medium Shading 2 Accent 4;\lsdpriority65 Medium List 1 Accent 4;\lsdpriority66 Medium List 2 Accent 4;\lsdpriority67 Medium Grid 1 Accent 4;\lsdpriority68 Medium Grid 2 Accent 4;\lsdpriority69 Medium Grid 3 Accent 4;\lsdpriority70 Dark List Accent 4;\lsdpriority71 Colorful Shading Accent 4;\lsdpriority72 Colorful List Accent 4;\lsdpriority73 Colorful Grid Accent 4;\lsdpriority60 Light Shading Accent 5;\lsdpriority61 Light List Accent 5;\lsdpriority62 Light Grid Accent 5;\lsdpriority63 Medium Shading 1 Accent 5;\lsdpriority64 Medium Shading 2 Accent 5;\lsdpriority65 Medium List 1 Accent 5;\lsdpriority66 Medium List 2 Accent 5;\lsdpriority67 Medium Grid 1 Accent 5;\lsdpriority68 Medium Grid 2 Accent 5;\lsdpriority69 Medium Grid 3 Accent 5;\lsdpriority70 Dark List Accent 5;\lsdpriority71 Colorful Shading Accent 5;\lsdpriority72 Colorful List Accent 5;\lsdpriority73 Colorful Grid Accent 5;\lsdpriority60 Light Shading Accent 6;\lsdpriority61 Light List Accent 6;\lsdpriority62 Light Grid Accent 6;\lsdpriority63 Medium Shading 1 Accent 6;\lsdpriority64 Medium Shading 2 Accent 6;\lsdpriority65 Medium List 1 Accent 6;\lsdpriority66 Medium List 2 Accent 6;\lsdpriority67 Medium Grid 1 Accent 6;\lsdpriority68 Medium Grid 2 Accent 6;\lsdpriority69 Medium Grid 3 Accent 6;\lsdpriority70 Dark List Accent 6;\lsdpriority71 Colorful Shading Accent 6;\lsdpriority72 Colorful List Accent 6;\lsdpriority73 Colorful Grid Accent 6;\lsdqformat1\lsdpriority19 Subtle Emphasis;\lsdqformat1\lsdpriority21 Intense Emphasis;\lsdqformat1\lsdpriority31 Subtle Reference;\lsdqformat1\lsdpriority32 Intense Reference;\lsdqformat1\lsdpriority33 Book Title;\lsdsemihidden1\lsdunhideused1\lsdpriority37 Bibliography;\lsdsemihidden1\lsdunhideused1\lsdqformat1\lsdpriority39 TOC Heading;}}}

The only interesting information here is that the snippet contains the string B A U T E C H N I K as a Arial 16 font. How can I get rid of all the extraneous style information?

BTW there is also text inserted by Aspose because I’m using the eval version, this is not a problem.

Thanks

Hi

Thanks for your request. Aspose.Words creates full RTF document with styles list etc, just like MS Word does. The only way to decrease little bit the size of the RTF document is specifying ExportCompactSize property, please see the following link:
https://reference.aspose.com/words/net/aspose.words.saving/rtfsaveoptions/exportcompactsize/
Best regards,

I did a test with Winword and the generated RTF is even bigger :(. I also tried out the ExportCompactSize option but the rtf still contains a lot of unneeded information.

The thing is that I need to store formatted text segments in a database. Storing 7kb of information for a segment that contains 20 characters of useful information is not doable, the database will grow too big much too fast.

I also had a look at Document.Styles but it seems it isn’t possible to remove items form that collection. Seems I’m stuck at the moment. Do you see another way to remove unwanted styles?

If this isn’t possible I’ll need to search for another library (we’re currently evaluating Aspose)

Compress the segments before saving them maybe?

Hi
Thanks for your request. Unfortunately, currently there is no way to remove styles from the document. Your request has been linked to the appropriate issue. You will be notified as soon as this feature is available.
Best regards,

Frebben:
Compress the segments before saving them maybe?

Frebben, yes, I also thought about that but the reduction in size won’t be dramatic. As a test I saved the whole RTF segment as given to me by Aspose as a file and compressed it with 7-Zip. I think the compressed size was about 1/4 th of the original size. This is better but still too much space wasted.

AndreyN:
Hi
Thanks for your request. Unfortunately, currently there is no way to remove styles from the document. Your request has been linked to the appropriate issue. You will be notified as soon as this feature is available.
Best regards,

Andrey, thanks for notifying me when the feature is available. Meanwhile, I’ll try to see if I can’t find a workaround.

Best regards

Hi Christophe,
Thanks for your inquiry.
You could also try to use a custom template with only a few styles defined in order to reduce the output RTF size. Please see the attached document which only defines a few standard styles. I have done a quick removal of most styles from the document directly by opening the document using WinZip and editing the content in styles.xml. If you load this template as your base document instead then the output RTF should not export any unused styles. I also suggest clearing any built-in and custom document properties to reduce size.
Combined with the techniques suggested by Andrey and Magnus, a quick test yielded a zipped output file <1 kb in size.
Hopefully this is useful.
Thanks,

Hi all,

I was able to further decrease the size of the generated RTF by combining everyone’s suggestions:

  1. I improved Adam’s suggestion by creating a RTF template in a MemoryStream that only contains the following string: {\rtf }. I then load this string into an Aspose.Words Document object and insert my RFT segment.
  2. I set the RtfSaveOptions.ExportCompactSize as suggested by Andrey
  3. I compress the rtf as sugested by Magnus.

I was able to obtain the following results:

  • Test segment as text: 175 bytes
  • Test segment as RTF: 1239 bytes
  • Test segment as zipped RTF: 765 bytes

The ratio between zipped RTF and pure text is now: 765 / 175 = 4,37 which starts to get in an acceptable range. Thanks all for your suggestions!

Any idea on how to further reduce the rtf size? For example, I see a {*\generator Aspose.Words for .NET 9.5.0.0;} in the RTF generated by Aspose (see below). I guess that this is a comment that I can safely remove?

{\rtf1\ansi\ansicpg1252\uc0\stshfdbch1\stshfloch1\stshfhich1\stshfbi1\deff1\adeff0{\fonttbl{\f0\fnil\fcharset0 Arial;}{\f1\fnil\fcharset0 Times New Roman;}}{\colortbl;\red0\green0\blue0;}{\stylesheet{\s0\snext0\styrsid8412110\sqformat\spriority0\ltrpar\li0\lin0\ri0\rin0\ql\faauto\rtlch\afs24\ltrch\fs24 Normal;}{*\cs10\additive\ssemihidden\spriority0 Default Paragraph Font;}}{*\rsidtbl\rsid3679196}{*\generator Aspose.Words for .NET 9.5.0.0;}{\info\version0\edmins0\nofpages0\nofwords0\nofchars0\nofcharsws0}\deflang1033\deflangfe2052\adeflang1025\jexpand\showxmlerrors1\validatexml1\viewscale100\fet0\dghspace180\dgvspace180\dghorigin1800\dgvorigin1440\dghshow1\dgvshow1\dgmargin\sectd\ltrsect\sectdefaultcl\pard\plain\itap0\s0\ltrpar\li0\lin0\ri0\rin0\ql\faauto\rtlch\afs24\ltrch\fs24{\rtlch\af0\afs17\ltrch\fs17\f0\lang2070\langnp2070\langfe1033\langfenp1033\insrsid3679196\cs10\v0\cf1 Aparafusar, firmemente, o encosto de amarra’e7’e3o com o anel de compensa’e7’e3o de metal encaixado (5) e arruela amortecedora (6) na 'e2ncora de amarra’e7’e3o de saneamento WEC}{\rtlch\afs24\ltrch\fs24\par}{*\latentstyles\lsdstimax267\lsdlockeddef0\lsdsemihiddendef1\lsdunhideuseddef1\lsdqformatdef0\lsdprioritydef99{\lsdlockedexcept}}}

Hi Christophe,

Thanks for your request. Yes, this is a comment and you can safely remove it.
Your document can contain neighbor runs with the same formatting; you can join them to slightly reduce the size of the document:
https://reference.aspose.com/words/net/aspose.words/document/joinrunswithsameformatting/
In addition, there can be empty paragraphs at the end of your document, you can try remove them, this also can slightly reduce the size of RTF document:

while (!doc.LastSection.Body.LastParagraph.HasChildNodes)
    doc.LastSection.Body.LastParagraph.Remove();

Also, I have one question. Will all of snippets, you need to store in DB, be small or there can be quite large snippets? RTF is a good option for small snippets of text. However, if your documents will be larger, DOCX or DOC format will be smaller than RTF.
Best regards,

Hi Alexey,

I tried Document.JoinRunsWithSameFormatting() but there weren’t any runs joined with my test documents. As for removing empty paragraphs, there aren’t any as the Documents I create only contain one paragraph. :-).

In fact the snippets I create will typically be small, as a snippet is typically a phrase or a title I copy out of the original Word document.

So for each text phrase I create a new Aspose.Words Document, and then save it’s content as RTF, compress it and store it in the database.

Today I changed the compression algorithm from zip to bzip2 and was able to reduce the size of the snippets by another 10%. There is still room for improvement here as there are more powerful compression algorithms available (LZMA, …).

Also, interestingly, I display the RTF with DevExpress’ XtraRichEdit control. Today I remarked that this control strips out the RTF tags that I don’t need for my snippets. The size of the RTF is then divided by 2.

However, I can’t filter each snippets by channeling it through an XtraRichEdit, this wouldn’t be very efficient. But it shows that there is still a lot of room of improvement to reduce the size of the RTF.

So I think I can let this area of our prototype as is, and revisit the size issue later. However, if in the future it will possible to get from Aspose.Words an RTF that doesn’t contain all page-related formatting tags by default, I’m interested

Thanks a lot for your help.

Hi Christophe,

Thank you for additional information. It is perfect that you managed to reduce the size using another compression algorithm.
If your document has only simple formatting and text, you can consider using HTML instead of RTF. HTML seems to me more compact than RTF.
Best regards,

Hi Alexey,

I already tested conversion to Html. It is more compact but during conversion styles names are changed or new styles are created. For example, I had a style in the Word document called Red Text (notice the empty space). After conversion to html the style name became RedText and a new style RedTextChar was created. I guess this is due to constraints in how CSS styles have to be named.

This means that when converting the html back to RTF or Doc the style names aren’t matched anymore (in my application I need to merge the edited text snippets back into the original Word Document). So this isn’t an option.

This isn’ t a trivial application but at least with Aspose.Words it will be doable

Thanks for your suggestion and best regards
Languagesen>fr GoogleDicCE
nommer, appeler, dénommer, désigner, citer, nom, prénom, réputation, renom

Hi Christophe,

Thank you for additional information. You are right, HTML very differs from MS Word formats. That why it is difficult and sometimes impossible to preserve all MS Word document’s features in HTML.
Best regards,

The issues you have found earlier (filed as WORDSNET-3291) have been fixed in this .NET update and this Java update.

This message was posted using Notification2Forum from Downloads module by aspose.notifier.
(9)