How to Extract Content Between Paragraphs with Heading1 style using .NET | Table of Content

Jugurtha_MAHDAD · April 2, 2021, 2:43pm

Hi all,
I have two problems in Aspose Word .net, when i want to split my word with table of contents (TOC):

The first is with the file source1.docx bellow:
My desired output is 5 files containing the content to which it refers us the hypertext link of the 5 titles of the first level in the content table (the bookmark paragraphe):

Introduction
Abbreviations & Definitions
Overview of a transit system
AMEX Requirements
Configuration Guidance

But when i execute my code, it returns 39 files, with all the content of the table.

The second problem with the file source2.docx:

My desire output is 2 files containing the content to which it refers us the hypertext link of the 2 titles of the first level in the content table also:

Le contexte
Description générale de la solution

But my code don’t return any word, the problem is in this line

if (tocItem != null && tocItem.Range.Replace(ControlChar.Tab, ControlChar.Tab) > 1)

This is my code:

Aspose.Words.License license = new Aspose.Words.License();

// This line attempts to set a license from several locations relative to the executable and Aspose.Words.dll.
// You can also use the additional overload to load a license from a stream, this is useful for instance when the
// license is stored as an embedded resource
try
{
    license.SetLicense("C:\\Aspose\\Aspose.Words.NET.lic");
    Console.WriteLine("License set successfully.");
}
catch (Exception e)
{
    // We do not ship any license with this example, visit the Aspose site to obtain either a temporary or permanent license.
    Console.WriteLine("\nThere was an error setting the license: " + e.Message);
}
Aspose.Words.Document doc = new Aspose.Words.Document(" C:\\Aspose\\ source1.docx");
//Document doc = new Document("C:\temp\Aspose Word\Source Word.docx");
DocumentBuilder builder = new DocumentBuilder(doc);

ArrayList listOfParagraphs = new ArrayList();
foreach (Field field in doc.Range.Fields)
{
    if (field.Type.Equals(Aspose.Words.Fields.FieldType.FieldHyperlink))
    {
        FieldHyperlink hyperlink = (FieldHyperlink)field;
        if (hyperlink.SubAddress != null && hyperlink.SubAddress.StartsWith("_Toc"))
        {
            Paragraph tocItem = (Paragraph)field.Start.GetAncestor(NodeType.Paragraph);
            if (tocItem != null && tocItem.Range.Replace(ControlChar.Tab, ControlChar.Tab) > 1)
            {
                Bookmark bm = doc.Range.Bookmarks[hyperlink.SubAddress];
                // Get the location this TOC Item is pointing to
                Paragraph pointer = (Paragraph)bm.BookmarkEnd.GetAncestor(NodeType.Paragraph);
                listOfParagraphs.Add(pointer);
            }
        }
    }
}

for (int i = 0; i < listOfParagraphs.Count; i++)
{
    Paragraph startPara = (Paragraph)listOfParagraphs[i];
    Paragraph endPara = null;

    if (i + 1 == listOfParagraphs.Count)
        endPara = doc.LastSection.Body.LastParagraph;
    else
        endPara = (Paragraph)listOfParagraphs[i + 1];

    ArrayList extractedNodes = ExtractContent(startPara, endPara, true);

    // Insert the content into a new separate document and save it to disk.
    Document dstDoc = GenerateDocument(doc, extractedNodes);
    dstDoc.LastSection.Body.LastParagraph.Remove();

    dstDoc.Save(" C:\\Aspose\\output" + i + ".docx");
}

Best regards.

tahir.manzoor · April 2, 2021, 6:09pm

@Jugurtha_MAHDAD

You want to extract the content between paragraphs that has style ‘Heading 1’. Please use the following code example to get the desired output.

Aspose.Words.Document doc = new Aspose.Words.Document(MyDir + "source1.docx");
//Document doc = new Document("C:\temp\Aspose Word\Source Word.docx");
DocumentBuilder builder = new DocumentBuilder(doc);

ArrayList listOfParagraphs = new ArrayList();
foreach (Field field in doc.Range.Fields)
{
    if (field.Type.Equals(Aspose.Words.Fields.FieldType.FieldHyperlink))
    {
        FieldHyperlink hyperlink = (FieldHyperlink)field;
        if (hyperlink.SubAddress != null && hyperlink.SubAddress.StartsWith("_Toc"))
        {
            Paragraph tocItem = (Paragraph)field.Start.GetAncestor(NodeType.Paragraph);


            if (tocItem != null && tocItem.Range.Replace(ControlChar.Tab, ControlChar.Tab) > 1)
            {
                Bookmark bm = doc.Range.Bookmarks[hyperlink.SubAddress];
                // Get the location this TOC Item is pointing to
                Paragraph pointer = (Paragraph)bm.BookmarkEnd.GetAncestor(NodeType.Paragraph);
                if (pointer.ParagraphFormat.StyleIdentifier == StyleIdentifier.Heading1)
                    listOfParagraphs.Add(pointer);
            }
        }
    }
}

for (int i = 0; i < listOfParagraphs.Count; i++)
{
    Paragraph startPara = (Paragraph)listOfParagraphs[i];
    Paragraph endPara = null;

    if (i + 1 == listOfParagraphs.Count)
        endPara = doc.LastSection.Body.LastParagraph;
    else
        endPara = (Paragraph)listOfParagraphs[i + 1];

    ArrayList extractedNodes = ExtractContent(startPara, endPara, true);

    // Insert the content into a new separate document and save it to disk.
    Document dstDoc = GenerateDocument(doc, extractedNodes);
    dstDoc.LastSection.Body.LastParagraph.Remove();

    dstDoc.Save(MyDir + "output" + i + ".docx");
}

Jugurtha_MAHDAD · April 4, 2021, 3:34pm

Thank you for your response, it’s working good with the source1.
But not with the word source2.docx.

Best regards.

Jugurtha_MAHDAD · April 4, 2021, 4:31pm

I found the solution, thank you @tahir.manzoor