RTF bullet list to numbered list via stream

I would like to take a text string which contains an RTF bulleted list of items and convert that to a numeric list, starting any any number I want (not always #1), and then take the resulting list and insert it into a Word document after an existing numbered list.

For example…
Existing Word document:
This is a paragraph of words and sentences.

  1. This is numbered list starting at 1 somewhere, maybe even on page 4
  2. This is number 2

RTF text string would look like this (in RTF with {\rtf1\ansi… etc…):

  • Bullet item
  • Another bullet item
  • Third bullet item

Resulting Word document will look like this:
This is a paragraph of words and sentences.

  1. This is numbered list starting at 1 somewhere, maybe even on page 4
  2. This is number 2
  3. Bullet item
  4. Another bullet item
  5. Third bullet item

I will know at runtime where to start numbering (in the case above it starts at #3).
The RTF data itself is not stored in a .RTF file. It’s stored in a database, so I can retrieve it as a text string (or binary).
The Word document is an actual Word .DOCX file and I will know where to start inserting the list.

@deisenberg You can easily achieve this using ImportFormatOptions.MergePastedLists and DocumentBuilder.InsertDocument method. For example see the following simple code:

Document doc = new Document(@"C:\Temp\in.docx");
Document src = new Document(@"C:\Temp\src.docx");

DocumentBuilder builder = new DocumentBuilder(doc);
builder.MoveToDocumentEnd();
builder.Writeln();

ImportFormatOptions options = new ImportFormatOptions();
options.MergePastedLists = true;

builder.InsertDocument(src, ImportFormatMode.UseDestinationStyles, options);

builder.ListFormat.RemoveNumbers();

doc.Save(@"C:\Temp\out.docx");

in.docx (13.2 KB) src.docx (13.6 KB) out.docx (11.1 KB)

In your case you simply load your source document from RTF string. You can use use code like this to achieve this:

private static Document RtfStringToDocument(string rtf)
{
    byte[] rtfBytes = Encoding.UTF8.GetBytes(rtf);
    using (MemoryStream rtfStream = new MemoryStream(rtfBytes))
        return new Document(rtfStream);
}

This does not appear to remove the bullets and begin numbering. It would need to remove the bullets and begin numbering starting at #3

@deisenberg Yes, the code merges the pasted lists so the numbering is continued from the list in the destination document. Please see the output document i have attached in my previous answer.
Could you please attach your input document, RTF string and expected output? We will check the issue and provide you more information.

I now see the sample docs you provided. I will give this a try, thanks. It won’t be located at the end of in.docx, it would be in the middle, and there may be several instances… each requiring another list found in RTF format.

The sample code does not number the bulleted list. It simply appends the bulleted list to the end of the numbered list.
I have a Word doc with a numbered list. There is a bookmark at the very end immediately following the numbered list.
The code extract below works by adding the RTF bulleted list to the end of the numbered list, but does not number it.

This is the code:

DocumentBuilder builder = new DocumentBuilder(MastDocument);
builder.MoveToBookmark(BookmarkName, false, true);
Document rtfDoc = new Aspose.Words.Document(RTFStream);
builder.InsertDocument(rtfDoc, ImportFormatMode.UseDestinationStyles, RTFNumberedOptions);
builder.ListFormat.RemoveNumbers();

saveDocHere

I can get it to work by placing the Bookmark at the next numbered item in the source document:

  1. This is item 1
  2. Item 2
  3. [Bookmark here]

Inserting using the code above will remove the bullets and begin numbering at #3, #4, #5, etc. correctly.

However, this does not work:

  1. Item 1
  2. Item 2
    a. [Bookmark 1 here]
  3. [Bookmark 2 here]

I want to insert bulleted items starting with (a.) at location [Bookmark 1] and starting with (3.) at [Bookmark 2]. But the command RemoveNumbers() does not behave properly. It changes (a.) into (3.) and inserts everything at that sublevel.

Basically, inserting a bulleted list at a bookmark position should continue the numbering from the parent object. If it is inserted at number 3 then each bullet inserted should start at 3, then 4, 5, etc. If it is inserted at 3.a then each bullet should be 3.a, 3.b, 3.c, etc.
All inserted items should also keep the parent’s font and paragraph formatting. The number color and font color, the spacing before/after, the font name, size, etc. I am finding that it will insert and keep the formatting of the bulleted items not the destination/parent of where it is finally ending up.

@deisenberg If your list items are simple text and you need to use the destination document formatting, maybe you should insert the list items as simple text. For example see the following code:

Document doc = new Document(@"C:\Temp\in.docx");
DocumentBuilder builder = new DocumentBuilder(doc);

InsertListItems(builder, new Document(@"C:\Temp\src.docx"), "level1");
InsertListItems(builder, new Document(@"C:\Temp\src.docx"), "level2");
InsertListItems(builder, new Document(@"C:\Temp\src.docx"), "level3");

doc.Save(@"C:\Temp\out.docx");
private static void InsertListItems(DocumentBuilder builder, Document src, string bookmark)
{
    builder.MoveToBookmark(bookmark);

    // Insert only text content fom the source document paragraphs.
    // In this case formatting will be inherited fron the formatting applied to the current builder position.
    NodeCollection paragraphs = src.GetChildNodes(NodeType.Paragraph, true);
    foreach (Paragraph p in paragraphs)
    {
        string pContent = p.ToString(SaveFormat.Text).Trim();

        if (paragraphs.IndexOf(p) == (paragraphs.Count - 1))
            builder.Write(pContent);
        else
            builder.Writeln(pContent);
    }
}

in.docx (13.4 KB) src.docx (13.6 KB) out.docx (10.8 KB)

When you use DocumentBuilder.Write or DocumentBuilder.Writeln formatting applied to the current DocumentBuilder positions will be applied to the inserted content.

Thank you, this is a MUCH better way to solve inserting text at a location from a source document. It keeps destination formatting, it keeps numbering, it’s perfect and does not require RemoveNumbering() or any other strange calls to fix formatting.
Now, if I can only get our users to stop adding blank lines it would look perfect!

What is the difference between:
paragraph.ToString(SaveFormat.Text).Trim()
and
paragraph.GetText().Trim()

I need to know two things:

  1. Is the line blank and should be removed? This will prevent blank numbered lines. Users have a bad habit of pressing Enter a few times at the end of paragraphs.
  2. Is the src paragraph a bulleted item or not? If it is a bulleted item then it should use builder.Write to insert it as a numbered item. Otherwise it should be a standalone un-numbered paragraph at the end of the numbers. Uses created a list of bulleted items which should be numbered, but the last non-bullet paragraph should be inserted at the end of the list as a standard paragraph not numbered. Is there a paragraph.isBulleted or something?

@deisenberg
paragraph.ToString(SaveFormat.Text) returns paragraph visible text, just like if you convert the document to TXT format. paragraph.GetText() returns text that includes field codes. For example see the attached sample document: in.docx (12.3 KB). It contains hyperlink.

Document doc = new Document(@"C:\Temp\in.docx");
Paragraph p = doc.FirstSection.Body.FirstParagraph;
Console.WriteLine(p.GetText().Trim());
Console.WriteLine(p.ToString(SaveFormat.Text).Trim());

This code will return:

‼ HYPERLINK "https://www.aspose.com" ¶test§
test

As you can see paragraph.GetText() returns both field code and field value, while paragraph.ToString(SaveFormat.Text) returns only field value (displayed text).

  1. You can easily skip empty paragraphs by checking Paragraph,HasChildNodes property or by simply skipping paragraphs with empty text.

  2. You can use Paragraph.IsListItem property to check whether paragraph is a list item. If you encounter paragraph that is not a list item, you can use DocumentBuilder.ListFormat.RemoveNumbers() and then use DocumentBuilder.Write or DocumentBuilder.Writeln methods to insert paragraphs as simple paragraphs.

So, if there is HTML code like a hyperlink stored in RTF data then I should be using p.GetText() so it is displayed properly on the final document, correct?

All the other advice you have given is working correctly.
It should be noted that this code:
if (paragraphs.IndexOf§ == (paragraphs.Count - 1))
will only work if there are no blank or trailing blank paragraphs. Otherwise, paragraphs.Count will give you too many paragraphs for which you may delete some.
My solution was to first count all non-blank paragraphs and use that count instead.

@deisenberg

No, to preserve fields it is not enough to use p.GetText(), you should copy actual nodes from source document. You can modify the InsertListItems method like this:

private static void InsertListItems(DocumentBuilder builder, Document src, string bookmark)
{
    builder.MoveToBookmark(bookmark);

    // Insert only text content fom the source document paragraphs.
    // In this case formatting will be inherited fron the formatting applied to the current builder position.
    List<Paragraph> paragraphs = src.GetChildNodes(NodeType.Paragraph, true)
        .Cast<Paragraph>().Where(p => p.HasChildNodes).ToList();

    foreach (Paragraph p in paragraphs)
    {
        foreach (Node child in p.ChildNodes)
        {
            Node dstNode = builder.Document.ImportNode(child, true, ImportFormatMode.UseDestinationStyles);

            // Clear formatting if the node is inline.
            Inline inline = dstNode as Inline;
            if (inline != null)
                inline.Font.ClearFormatting();

            // Put the nodes into the current paragraph.
            builder.InsertNode(dstNode);
        }

        // Insert a paragraph break if required.
        if (paragraphs.IndexOf(p) < (paragraphs.Count-1))
            builder.Writeln();
    }
}

I have modified the code to filer empty paragraphs.

Why clear formatting for inline nodes? That removes any Bold/Italic/Etc. within the destination document that may be used at that location. Is this necessary for HTML links or something else?

@deisenberg The formatting is cleared according your your requirements:

If you need to keep formatting applied to inline nodes in your source document, you can remove the following lines of code:

// Clear formatting if the node is inline.
Inline inline = dstNode as Inline;
if (inline != null)
    inline.Font.ClearFormatting();

What I meant by “parent doc” is in.docx which is used as a template for inserting values from src.docx in order to create out.docx.
Managers create in.docx in the styling they choose.
Employees/users create src.docx as data they wish to insert, including hyperlinks.
System merges them and creates out.docx as a final document containing all the user’s data inserted into the template, keeping template styling created by managers with data created by employees.
See attached samples.src.docx (14.1 KB)
out.docx (14.7 KB)
in.docx (13.8 KB)

@deisenberg You can modify the code to apply the formatting from the current position of DocumentBuilder. For example see the following code:

private static void InsertListItems(DocumentBuilder builder, Document src, string bookmark)
{
    builder.MoveToBookmark(bookmark);

    // Insert only text content fom the source document paragraphs.
    // In this case formatting will be inherited fron the formatting applied to the current builder position.
    List<Paragraph> paragraphs = src.GetChildNodes(NodeType.Paragraph, true)
        .Cast<Paragraph>().Where(p => p.HasChildNodes).ToList();

    foreach (Paragraph p in paragraphs)
    {
        // Disable numbering if paragrap is not list item.
        if (!p.IsListItem)
            builder.ListFormat.RemoveNumbers();

        foreach (Node child in p.ChildNodes)
        {
            Node dstNode = builder.Document.ImportNode(child, true, ImportFormatMode.UseDestinationStyles);

            // Apply formattign from the current DocumentBuilder position.
            Inline inline = dstNode as Inline;
            if (inline != null)
                ApplyCurrentNodeFormatting(builder.Font, inline.Font);

            // Put the nodes into the current paragraph.
            builder.InsertNode(dstNode);
        }

        // Insert a paragraph break if required.
        if (paragraphs.IndexOf(p) < (paragraphs.Count-1))
            builder.Writeln();
    }
}

private static void ApplyCurrentNodeFormatting(Aspose.Words.Font src, Aspose.Words.Font dst)
{
    dst.Italic = src.Italic;
    dst.Bold = src.Bold;
    // ......
    // Here you can apply more formatting from the source.
}

So far so good. Now I must insert numbered items even when non-numbered items appear between them. I can’t find how to remember my previous number and indent position.
See attached samplesout.docx (14.3 KB)
src.docx (14.0 KB)
in.docx (13.4 KB)

While the example I provided above starts with number 1, it is possible to already be at level 2 under number a. and then continue with b. just as it would with number 2. The number/letter and indent must be consistent with where it left off prior to losing the bulleted item.