Parse rtf code in a word document

Hi,

We have a word document which has a block of rtf code in between normal word content. we need a way to read this rtf code using ASPOSE and parse it to normal word content.

Please find the attached document that has the block of rtf code.

Thanks,

Hari

Hi Hari,

Thanks for your inquiry. Please use the following code snippet to achieve your requirements. Hope this helps you. Please let us know if you have any more queries.

Document doc = new Document(MyDir + "Test13.rtf");
String strRTF = "";
foreach (Paragraph para in doc.GetChildNodes(NodeType.Paragraph, true))
{
    if (para.ToString(SaveFormat.Text).Trim().StartsWith(@"{\rtf1"))
        strRTF += para.ToString(SaveFormat.Text).Trim();
}
if (strRTF != "")
{
    Document rtfDoc = RtfStringToDocument(strRTF);
    rtfDoc.Save(MyDir + "Out.docx", SaveFormat.Docx);
}
private static Document RtfStringToDocument(string rtf)
{
    Document doc = null;
    // Convert RTF string to byte array.
    byte[] rtfBytes = Encoding.UTF8.GetBytes(rtf);
    // Create stream.
    using (MemoryStream rtfStream = new MemoryStream(rtfBytes))
    {
        // Open document from stream.
        doc = new Document(rtfStream);
    }
    return doc;
}

Thank you. This works for me. But I have a problem with this solution.

This solution is removing all the content in my existing doc and the end document is just having the rtf string content only. That’s not what I wnated.

My docuument will have lot of default content. In between the default content this rtf string will be available.

Please refer the attached document.

Hi Hari,

Thanks for your inquiry. Please use the following code snippet to achieve your requirements. Please check the detail of InsertDocument from here:

https://docs.aspose.com/words/java/insert-and-append-documents/

The following code example convert the Paragraph which starts with {rtf1 to Rtf document and insert that document at the place of that paragraph.

Document doc = new Document(MyDir + "New+Text+Document.rtf");
foreach (Paragraph para in doc.GetChildNodes(NodeType.Paragraph, true))
{
    if (para.ToString(SaveFormat.Text).Trim().StartsWith(@"{\rtf1"))
    {
        Document rtfDoc = RtfStringToDocument(para.ToString(SaveFormat.Text));
        InsertDocument(para, rtfDoc);
        para.Remove();
    }
}
doc.Save(MyDir + "Out.docx", SaveFormat.Docx);
private static Document RtfStringToDocument(string rtf)
{
    Document doc = null;
    // Convert RTF string to byte array.
    byte[] rtfBytes = Encoding.UTF8.GetBytes(rtf);
    // Create stream.
    using (MemoryStream rtfStream = new MemoryStream(rtfBytes))
    {
        // Open document from stream.
        doc = new Document(rtfStream);
    }
    return doc;
}

The above code only works when RTF contents are in once Paragraph. I suggest you please enclose RTF contents in bookmark. Once you have these contents inside bookmark, you can easily extract rtf contents from a bookmark and convert it into Rtf document. Please read following documentation link for your kind reference.

https://docs.aspose.com/words/java/extract-selected-content-between-nodes/

Once you have rtf string, you can convert it to Rtf document and insert it at any location in the final output document.

Thank you very much. This is exactly what we wanted. We will contact you again if required.

Hi Hari,

Thanks for your feedback. Please feel free to ask if you have any question about Aspose.Words, we will be happy to help you.

Hi,

After converting the Rtf string to Document content and inserting this new paragraph in the source document in place of the orginal rtf text, The new paragraph is not having the allignment correctly wrt to previous paragraph with rtf texts.

How can I retain the same paragraph format for the new paragraph similar to old one?

If it is not possible to retain the paragraph formatting, then please let me know How can I apply the below margines to new paragraph that is inserted as document content?

//apply the margins
if (!para.IsListItem && para.GetAncestor(NodeType.Table) == null)
{
    para.ParagraphFormat.LeftIndent = 0;
    para.ParagraphFormat.FirstLineIndent = 0;
    para.ParagraphFormat.SpaceBeforeAuto = false;
    para.ParagraphFormat.SpaceAfterAuto = false;
}

Hi Hari,

Thanks for your inquiry. Please note that Aspose.Words mimics the same behavior as MS Word. In your case, the RTF string is converted into document and then inserted into target document. If you insert the same list (rtf string) into document by using MS Word, you will get the same output. I suggest you please read the difference between ImportFormat modes and ‘Controlling How Lists are Handled’.

https://docs.aspose.com/words/net/insert-and-append-documents/

https://docs.aspose.com/words/net/working-with-lists/

Yes, you can format the list items of MS Word document according to your requirements. In your test document, there are some empty Paragraphs before the RTF string with different left indentation and first paragraph of document is list item. If your original document have same Paragraph structure, you can use following code snippet to set the left indentation of RTF document (rtf string).

The following code snippet inserts the bookmark ‘rtf’ before and after RTF. After inserting the RTF document, this code example iterate through all nodes between bookmark ‘rtf’ and set the LeftIndent of list item. Hope this helps you.

If you still face problem, please share your original input and expected output document. Please manually create your expected Word document using Microsoft Word and attach it here for our reference. We will investigate, how you want your final Word output be generated like. We will then provide you more information on this along with code.

Document doc = new Document(MyDir + "New+Text+Document.rtf");
DocumentBuilder builder = new DocumentBuilder(doc);
Paragraph ListFormatParagraph = null;
foreach (Paragraph para in doc.GetChildNodes(NodeType.Paragraph, true))
{
    if (para.ListFormat.IsListItem)
        ListFormatParagraph = para;
    if (para.ToString(SaveFormat.Text).Trim().StartsWith(@"{\rtf1"))
    {
        Paragraph p = (Paragraph)para.PreviousSibling;
        builder.MoveTo(p);
        builder.StartBookmark("rtf");
        Document rtfDoc = RtfStringToDocument(para.ToString(SaveFormat.Text));
        InsertDocument(p, rtfDoc);
        builder.MoveTo(para);
        builder.Writeln("");
        builder.EndBookmark("rtf");
        para.Remove();
    }
}
if (ListFormatParagraph != null)
{
    Node currentNode = doc.Range.Bookmarks["rtf"].BookmarkStart;
    while (currentNode != doc.Range.Bookmarks["rtf"].BookmarkEnd)
    {
        currentNode = currentNode.NextPreOrder(doc);
        if (currentNode.NodeType == NodeType.Paragraph)
        {
            Paragraph paragraph = (Paragraph)currentNode;
            if (!paragraph.IsListItem && paragraph.GetAncestor(NodeType.Table) == null)
            {
                paragraph.ParagraphFormat.LeftIndent += ListFormatParagraph.ParagraphFormat.LeftIndent;
                paragraph.ParagraphFormat.FirstLineIndent = 0;
                paragraph.ParagraphFormat.SpaceBeforeAuto = false;
                paragraph.ParagraphFormat.SpaceAfterAuto = false;
            }
        }
    }
}
doc.Save(MyDir + "Out.docx", SaveFormat.Docx);

Hi Manzzor,

Thank you very much for your detailed response. We have the input rtf texts in many formats. So we can’t generalize the paragraph formatting for all rtf texts. So, I couldn’t use the above code with DocumentBuilder and paragraph formatting.

I am still using the solution provided by you on 08-22-2013, 10:15 AM

As suggested I worked on creating the sample input and output file. Please find attached the Input_RTF_Text.doc and Expected_DocContent_Output.doc file.

If we are able to just convert the display the trf text in document format by maintining the allignment thats enough. We dont have to maintain the list’s numbering or Highlighiting. Please refer to attached expected output doc.

Please let me know how can we convert the rtf text similar to expected output.

Hi Hari,

Thanks for your inquiry.

I have used the same code example shared at this post to convert RTF contents into document and have not found any issue with output file. I have used the ImportFormatMode as UseDestinationStyles in InsertDocument method. Please find the output Docx with this post for your kind reference. The output document is same as expected document which you have shared.

It would be great if you please share some more detail about your query. We will then provide you more information about your query along with code.

Hi Manzoor,

I changed the ImportFormatMode as UseDestinationStyles in InsertDocument method and generated the output. Infact I get the same results for both the avaialble options of ImportFormatMode.

  1. The only issue that I am seeing is the allignment of the paragraph starting text…ie the first line. The first line is not inline with the other lines. Please see the attached AsposeIssueDescription.png and Output_Aspose.doc for details. Please let me know how can I correct this issue. I am using the all code suggested at this post.

  2. My second query is: Sometimes we will have other document text before the rtf text content with-in the same paragraph. We need to retain this text with-in the paragraph AS-IS and replace only the rtf text content with the paragraph. Please find the attached input_RtfText.doc and expected_DocContent.doc.

Please let me know the best way to alter the code that you provided at this post to achive this requirement.

Hi Hari,

Thanks for your inquiry.

fmr patel:

1. The only issue that I am seeing is the allignment of the paragraph starting text…ie the first line. The first line is not inline with the other lines. Please see the attached AsposeIssueDescription.png and Output_Aspose.doc for details. Please let me know how can I correct this issue. I am using the all code suggested at this post.

Please note that the RTF strings in your document are started with spaces. You can check this by converting the RTF string to normal MS Word content by using MS Word. However, you can remove these spaces by using following code snippet.

The Run class represents a run of characters with the same font formatting. All text of the document is stored in runs of text.

Document doc = new Document(MyDir + "Input_RTF_Text.doc");
foreach (Paragraph para in doc.GetChildNodes(NodeType.Paragraph, true))
{
    if (para.ToString(SaveFormat.Text).Trim().StartsWith(@"{\rtf1"))
    {
        Document rtfDoc = RtfStringToDocument(para.ToString(SaveFormat.Text).Trim());
        foreach (Run run in rtfDoc.FirstSection.Body.Paragraphs[0].GetChildNodes(NodeType.Run, true))
        {
            if (run.Text.Trim() == "")
                run.Remove();
            else
            {
                if (run.Text.Trim().StartsWith(" "))
                {
                    run.Text = run.Text.Trim();
                    break;
                }
                else
                    break;
            }
        }
        InsertDocument(para, rtfDoc);
        para.Remove();
    }
}
doc.Save(MyDir + "Out.docx", SaveFormat.Docx);

fmr patel:

2. My second query is: Sometimes we will have other document text before the rtf text content with-in the same paragraph. We need to retain this text with-in the paragraph AS-IS and replace only the rtf text content with the paragraph. Please find the attached input_RtfText.doc and expected_DocContent.doc.

I am working over your query and will update you asap.

Hi there,

Thanks for your patience.

fmr patel:

2. My second query is: Sometimes we will have other document text before the rtf text content with-in the same paragraph. We need to retain this text with-in the paragraph AS-IS and replace only the rtf text content with the paragraph. Please find the attached input_RtfText.doc and expected_DocContent.doc.

Please use the following code snippet to achieve your requirements. Hope this helps you. Please let us know if you have any more queries.

Document doc = new Document(MyDir + "input_RtfText.doc");
DocumentBuilder builder = new DocumentBuilder(doc);
foreach (Paragraph para in doc.GetChildNodes(NodeType.Paragraph, true).ToArray())
{
    if (para.ToString(SaveFormat.Text).Trim().Contains(@"{\rtf1"))
    {
        Run startNode = null;
        String rtfText = "";
        foreach (Run run in para.Runs.ToArray())
        {
            if (run.ToString(SaveFormat.Text).Trim().StartsWith(@"{\rtf1"))
            {
                startNode = run;
                rtfText += run.Text;
                continue;
            }
            if (run.ToString(SaveFormat.Text).Trim().EndsWith(@"\par }"))
            {
                rtfText += run.Text;
                run.Remove();
                break;
            }
            if (rtfText != "")
            {
                rtfText += run.Text;
                run.Remove();
            }
        }
        Document rtfDoc = RtfStringToDocument(rtfText.Trim());
        InsertDocument(para, rtfDoc);
        startNode.Remove();
    }
}
doc.Save(MyDir + "Out.docx", SaveFormat.Docx);

Hi Manzoor,

Thank your very much for the code for my 2nd query. The only issue that I see here is, The document converted rtf Text is being added in the new line always as a new paragraph. It needs to be added in the same place as the original rtf text. Please see the expected_DocContent.doc that I sent in previous post.

I have also attached the current output Output_Aspose1.doc and the paragraph_Issue.png file with issue description.

Please let me know how can we insert the converted rtf text in the same place as it was before.

Hi there,

Thanks for your inquiry. I have modified the code according to your requirements. In this case, you need to join current Paragraph with newly inserted Paragraph (rtf contents). Hope this helps you. Please let us know if you have any more queries.

Document doc = new Document(MyDir + "input_RtfText.doc");
DocumentBuilder builder = new DocumentBuilder(doc);
foreach (Paragraph para in doc.GetChildNodes(NodeType.Paragraph, true).ToArray())
{
    if (para.ToString(SaveFormat.Text).Trim().Contains(@"{\rtf1"))
    {
        Run startNode = null;
        String rtfText = "";
        foreach (Run run in para.Runs.ToArray())
        {
            if (run.ToString(SaveFormat.Text).Trim().StartsWith(@"{\rtf1"))
            {
                startNode = run;
                rtfText += run.Text;
                continue;
            }
            if (run.ToString(SaveFormat.Text).Trim().EndsWith(@"\par }"))
            {
                rtfText += run.Text;
                run.Remove();
                break;
            }
            if (rtfText != "")
            {
                rtfText += run.Text;
                run.Remove();
            }
        }
        Document rtfDoc = RtfStringToDocument(rtfText.Trim());
        InsertDocument(para, rtfDoc);
        startNode.Remove();
        if (para.NextSibling != null)
        {
            Paragraph nextPara = (Paragraph)para.NextSibling;
            // Move all content from the nextPara paragraph into the first.
            while (nextPara.HasChildNodes)
                para.AppendChild(nextPara.FirstChild);
            nextPara.Remove();
        }
    }
}
doc.Save(MyDir + "Out.docx", SaveFormat.Docx);

Hi,

I am seeing the formatting differences when I convert the rtf content to document content. I am using the code suggested in the previous post (496705 in reply to 496546) only.

Please see the attached Rtf_to_WordConversion.zip file which has the following files.

Input_File.rtf - has the rtf content

Aspose_Output.doc → The current output that I am getting

Expected_Ouput.doc → As per the rtf content (Which was previous converted from word to rtf) the output should look like this…

Please analyze the issue and assist.

Hi there,

Thanks for your inquiry. Please note that Aspose.Words mimics the same behavior as MS Word does. If you insert the same RTF document into Input_File.rtf by using MS Word, you will get the same output. However, I have found the paragraph space after issue in output document. You can set it by using ParagraphFormat.SpaceAfter. Please check the following code snippet.

..............
..............
Document rtfDoc = RtfStringToDocument(rtfText.Trim());
foreach (Paragraph rtfPara in rtfDoc.GetChildNodes(NodeType.Paragraph, true))
{
    rtfPara.ParagraphFormat.SpaceAfter = 0;
}
InsertDocument(para, rtfDoc);
if (startNode != null)
    startNode.Remove();
...............

Hi,

I want to remove the new line char OR linebreak from the paragraph which is passed in the InsertDocument(para, rtfDoc) method. With this we can exactly insert the rtfToDoc Converted content in the same place as the original rtf content. How can I acheive it?.

Right now, it is always inserting the converted Document tex in the new line. Because of this we are seeing lot of formatting issues. You can use the same sample files what I sent in the previous place. I want the converted Document text to be start exactly at the same location as {\rtf1.

I tried with the below commented code, But it is not working as the run is not picking the \r characters. Also, let me know If I can pass the startNode to the InsertDocument(…) method instead of the para and get it inserted in the right place. In this case I need to modified Insert Document.

foreach (Paragraph para in _pAsposeDocument.GetChildNodes(NodeType.Paragraph, true).ToArray())
{
    if (para.GetText().Contains(@"{\rtf1"))
    {
        Run startNode = null;
        String rtfText = "";
        foreach (Run run in para.Runs.ToArray())
        {
            if (startNode == null)
            {
                if (run.Text.StartsWith(@"{\rtf1"))
                {
                    startNode = run;
                    rtfText += run.Text;
                    continue;
                }
            }
            else
            {
                rtfText += run.Text;
                if (run.Text.Trim().EndsWith(@"\par }"))
                {
                    run.Remove();
                    break;
                }
                else
                    run.Remove();
            }
        }
        Document rtfDoc = RtfStringToDocument(rtfText.Trim());
        //remove the new line char if exists in the para.This is not working
        //foreach (Run run in para.Runs.ToArray())
        //{
        // if (run.Text.EndsWith("\\r"))
        // run.Text = run.Text.Replace("\\r", "");
        //} 
        InsertDocument(para, rtfDoc);
        if (startNode != null)
            startNode.Remove();
    }
}

Hi there,

Thanks for your inquiry.

fmr patel:

I want to remove the new line char OR linebreak from the paragraph which is passed in the InsertDocument(para, rtfDoc) method. With this we can exactly insert the rtfToDoc Converted content in the same place as the original rtf content. How can I acheive it?.
Right now, it is always inserting the converted Document tex in the new line.

Please use the following code snippet to replace “\r” control character with empty string. I think, your query is related to Paragraph break. The InsertDocument insert the contents after a specific Paragraph. In this case, you need to remove the Paragraph which contains the RTF contents.

…
… 
Document rtfDoc = RtfStringToDocument(rtfText.Trim());
foreach (Run run in para.Runs.ToArray())
{
    if (run.Text.Contains(ControlChar.Cr))
        run.Text = run.Text.Replace(ControlChar.Cr, "");
}
InsertDocument(para, rtfDoc);

fmr patel:

Because of this we are seeing lot of formatting issues. You can use the same sample files what I sent in the previous place. I want the converted Document text to be start exactly at the same location as {\rtf1. I tried with the below commented code, But it is not working as the run is not picking the \r characters.

It would be great if you please share some more detail about this issue. As I have not found any issue with output document. The extracted RTF contents are inserted at the position of same paragraph. Please share some detail about line break issue along with screen shot of problematic area in output document.

fmr patel:

Also, let me know If I can pass the startNode to the InsertDocument(…) method instead of the para and get it inserted in the right place. In this case I need to modified Insert Document.

The first parameter of InsertDocument method should be paragraph or table. However, you can use the same technique to import nodes into the document from another document.

I suggest you, please use ImportFormatMode as KeepSourceFormatting in InsertDocument. I have used the following code snippet to generate the final document. I have attached the output document and extracted RTF files (using MS Word) with this post.

Note : Please try to insert the attached extracted RTF files at the position of {\rtf1 in input RTF using MS Word and check the MS Word behavior.

Document doc = new Document(MyDir + "Input_File.rtf");
DocumentBuilder builder = new DocumentBuilder(doc);
foreach (Paragraph para in doc.GetChildNodes(NodeType.Paragraph, true).ToArray())
{
    if (para.ToString(SaveFormat.Text).Trim().Contains(@"{\rtf1"))
    {
        Run startNode = null;
        String rtfText = "";
        foreach (Run run in para.Runs.ToArray())
        {
            if (run.ToString(SaveFormat.Text).Trim().StartsWith(@"{\"))
            {
                startNode = run;
                rtfText += run.Text;
                continue;
            }
            if (run.ToString(SaveFormat.Text).Trim().EndsWith(@"\par }"))
            {
                rtfText += run.Text;
                run.Remove();
                break;
            }
            if (rtfText != "")
            {
                rtfText += run.Text;
                run.Remove();
            }
        }
        Document rtfDoc = RtfStringToDocument(rtfText.Trim());
        foreach (Run run in para.Runs.ToArray())
        {
            if (run.Text.Contains(ControlChar.Cr))
                run.Text = run.Text.Replace(ControlChar.Cr, "");
        }
        InsertDocument(para, rtfDoc); // use ImportFormatMode.KeepSourceFormatting in InsertDocument
        if (startNode != null)
            startNode.Remove();
        para.Remove();
    }
}
doc.Save(MyDir + "Out.docx", SaveFormat.Docx);

Hi Tahir,

Thanks for your response. First of all the below code to remove the pargraph break is not working. The run.text is never giving the ControlChar.Cr for any of the runs…

foreach (Run run in para.Runs.ToArray())
{
    if (run.Text.Contains(ControlChar.Cr))
        run.Text = run.Text.Replace(ControlChar.Cr, "");
}

Also, We need to remove or comment the below line as we are already deleting the StartNode.

para.Remove();

I tested it by having the non-Rtf text before the rtf content in the same paragraph. The converted rtfDoc got inserted in the new line only. See the attached Rtf_to_WordConversion_v1.zip. I have updated the commentes in the Aspose_Output.docx. Please compare this output with the Expected_Output.doc.

eg:

  1. rtf text:

Testing-2:{\rtf1\ansi\ansicpg1252\deff0\deflang1033{\fonttbl{\f0\froman\fprq2\fcharset0 Times New Roman;}{\f1\fnil\fcharset0 Times New Roman;}} \viewkind4\uc1\pard\sa240\b\i\f0\fs20 The following replaces Section 11.02 of the Basic Sunday Document:\par \pard\ul\i0 Late Wednesday\ulnone . \b0 If a Tuesday continues in employment as an Employee after his Normal Wednesday Age, he shall continue to have a 100 Jan vested interest in his Mar and shall continue to participate in the Sunday until his Severance Date. A Tuesday may not elect to receive a distribution of his Mar until his severance from employment in accordance with Articles 12 and 13.\f1\par }

  1. Current Output:

Testing-2:

The following replaces Section 11.02 of the Basic Sunday Document:

Late Wednesday. If a Tuesday continues in employment as an Employee after his Normal Wednesday Age, he shall continue to have a 100 Jan vested interest in his Mar and shall continue to participate in the Sunday until his Severance Date. A Tuesday may not elect to receive a distribution of his Mar until his severance from employment in accordance with Articles 12 and 13.

  1. Expected Output:

Testing-2:The following replaces Section 11.02 of the Basic Sunday Document:

Late Wednesday. If a Tuesday continues in employment as an Employee after his Normal Wednesday Age, he shall continue to have a 100 Jan vested interest in his Mar and shall continue to participate in the Sunday until his Severance Date. A Tuesday may not elect to receive a distribution of his Mar until his severance from employment in accordance with Articles 12 and 13.

I was expecting that the suggested solution with removing the paragraph break will insert the converted rtfDoc exactly at the start of “{\rtf1” text after “Testing-2:”, But it didn’t.

I am able to resolve the above issue by using the code highlighted in yellow in CodeSample.docx.

Let me know, If you have any other alternative solution to handle the scenarios explained above.

However, I am still having the formatting issues mentioned in the comments in the Aspose_Ouput.doc