Word TOC and content parsing

Tahir,

Thank you so much.

I will try it out.

Nitin

Hi Nitin,

Please feel free to ask if you have any question about Aspose.Words, we will be happy to help you.

3 more question

  1. How can I check if the first page (after TOC) has an image? This is because

I am trying to parse a document that is an ebook so it may have a cover image.

  1. So if the image is present (see 1 above) then how can I save that image to

an image file?

  1. How can I enumerate the book marks in the word doc?

Thanks so much…

Hi
The following code fails

String fieldText = GetFieldCode(fstart);

ArrayList extractedNodes = ExtractContent(bookmarkStart, bookmarkEnd, false);

Document doc2 = GenerateDocument(doc, extractedNodes);

because:
GetFieldCode
ExtractContent
GenerateDocument
are undefined.

Hi
The following code fails

String fieldText = GetFieldCode(fstart);

ArrayList extractedNodes = ExtractContent(bookmarkStart, bookmarkEnd, false);

Document doc2 = GenerateDocument(doc, extractedNodes);

because the methods:
GetFieldCode
ExtractContent
GenerateDocument

are undefined.

Hi Nitin,

Thanks for your query. Please find the code of ExtractContent and GenerateDocument method from following documentation link.

https://docs.aspose.com/words/net/how-to-extract-selected-content-between-nodes-in-a-document/

GetFieldCode method:

private static string GetFieldCode(Aspose.Words.Fields.FieldStart fieldStart)
{
    StringBuilder builder = new StringBuilder();
    for (Node node = fieldStart; node != null && node.NodeType != NodeType.FieldSeparator && node.NodeType != NodeType.FieldEnd; node = node.NextPreOrder(node.Document))
    {
        // Use the text only of Run nodes to avoid duplication.
        if (node.NodeType == NodeType.Run)
            builder.Append(node.GetText());
    }
    return builder.ToString();
}

You can get first image from document by using following code snippet.

Document doc = new Document(MyDir + "tocx.docx");
Shape shape = (Shape)doc.GetChild(NodeType.Shape, 0, true);
if (shape.HasImage)
{
    string imageFileName = string.Format(
        "Image.ExportImages.{0} Out{1}", 1, FileFormatUtil.ImageTypeToExtension(shape.ImageData.ImageType));
    shape.ImageData.Save(MyDir + imageFileName);
}

Regarding your following question, Please see the following code snippet to work with Bookmark. Please let us know if this does not help you.

3. How can I enumerate the book marks in the word doc?

Document doc = new Document(MyDir + "tocx.docx");
Console.WriteLine(doc.Range.Bookmarks.Count);
foreach (Aspose.Words.Bookmark bookmark in doc.Range.Bookmarks)
{
    // your code
}

Hope this answers your queries. Please let us know if you have any more queries.

Hi Nitin! Hi Tahir!

I´m sorry for writing in this thread, but I have a similar requirement and your code helped me a lot.

I have to further questions and would be very glad if you could help me.

  1. Is it possible to get information about how many pages the TOC is using?

  2. I like the extracting of the TOC contents very much, but I would like to extract the chapter in one file.
    As Example:

1.textextext
1.1.textunder1
1.1.1.textundertextunder1
2.text2

…and so on

Now I would like to get the whole content of 1 in a single document including 1.1 and 1.1.1 and the content of 2 in another document including all underchapters

I hope you can understand my requirement and help me on.

best regards

Hi Nitin,

Thanks for your query.

1. Is it possible to get information about how many pages the TOC is using?

Word document is flow document and does not contain any information about its layout into lines and pages. Therefore, technically there is no “Page” concept in Word document.

2. I like the extracting of the TOC contents very much, but I would like to extract the chapter in one file.

Please check the code at following forum link for your reference. Hope this helps you. Please let us know if you have any more queries.

https://forum.aspose.com/t/60963

Hi Tahir!

Thanks for your Answer!

  1. I have already managed to get the TOC information how I need it thanks for the information.

  2. I will check the code as soon as possible and let you know if I need more help about it.

Thanks for helping me on Tahir

best regards

Hi Tahir,

Thank you for the info.

very much appreciated

Nitin

Hi Nitin,

Please feel free to ask if you have any question about Aspose.Words, we will be happy to help you.

Hi Tahir!

I´ve checked the link above and this looks very interesting.

I tried to manage get it work for my requirement, but I couldn´t get it.

So it would be great if you could give me little help about this.
The thing I would like to do is to extract the whole content of 1 chapter into a new document. The problem for me is to know where the chapter starts and where it ends. The only information i get is 1 bookmark anywhere in one chapter and I want to extract the whole chapter where the bookmark is.

I don´t know if this is possible to do, but I think there must be a way to realise this with Aspose

best regards

Hi Nitin,

Thanks for your query. It would be great if you please share your document. I will work on your document and share the code accordingly. Have you tried the code shared at following forum link?

https://forum.aspose.com/t/53393

Hi Tahir!

Thanks for your answer! Sure I can upload a sample document.

Well about the link… It references the threat we are writing in isn´t it? And I tried to achive my requierment with the code above but I couln´t get it. Any help would be great.

best reagards

Hi Nitin,

Thanks for sharing the document. In your scenario, I suggest you to bookmark the contents of each chapter and use code shared at following documentation link. You can extract the contents from a bookmark.

https://docs.aspose.com/words/net/how-to-extract-selected-content-between-nodes-in-a-document/

Please let us know if you have any more queries.

The above code works very well for all TOC Items
EXCEPT the LAST item. The Last item’s content is not picked up.
I think this is because of the [i+1] in the code yellow code below.

for (int i = 0; i < tocitems.Count - 1; i++)
{
    BookmarkStart bookmarkStart = doc.Range.Bookmarks[tocitems[i].ToString()].BookmarkStart;
    BookmarkStart bookmarkEnd = doc.Range.Bookmarks[tocitems[i + 1].ToString()].BookmarkStart;
    // Firstly extract the content between these nodes including the bookmark.
    ArrayList extractedNodes = ExtractContent(bookmarkStart, bookmarkEnd, false);
    Document doc2 = GenerateDocument(doc, extractedNodes);
    doc2.Save(MyDir + tocitems[i] + "AsposeOut.docx");
}

Hi Nitin,

Thanks for your inquiry. It would be great if you please share following detail for investigation purposes.

  • Please attach your input Word document.
  • Please

create a standalone/runnable simple application (for example a Console
Application Project
) that demonstrates the code (Aspose.Words code) you used to generate
your output document

  • Please attach the output Word file that shows the undesired behavior.
  • Please
    attach your target Word document showing the desired behavior. You can
    use Microsoft Word to create your target Word document. I will
    investigate as to how you are expecting your final document be generated
    like.

Unfortunately,
it is difficult to say what the problem is without the Document(s) and
simplified application. We need your Document(s) and simple project to
reproduce the problem. As soon as you get these pieces of information to
us we’ll start our investigation into your issue.

Hi Tahir,

See attached a sample VS2013 solution.
The sample doc is in the Debug/bin folder.

You will see that I have 11 TOC items in my doc.
But the code only extract the content for 10 TOC items.
The Last item is never not extracted.

The code basically gets all TOC items
Then extracts the content BETWEEN one link and the next sibling.
The problem is that for the LAST TOC item there is NO next sibling,
and therefore does not extract.

Hi Nitin,

Thanks for your inquiry. In your case, I suggest you please insert a bookmark at the end of document and add it into toc array list as shown below. Hope this helps you. Please let us know if you have any more queries.

Document doc = new Document(WordFilePath);
DocumentBuilder builder = new DocumentBuilder(doc);
builder.MoveToDocumentEnd();
builder.StartBookmark("_TocEnd");
builder.EndBookmark("_TocEnd");
//***********************************************************
// GET CONTENT
// **********************************************************
NodeCollection nodes = doc.GetChildNodes(NodeType.FieldStart, true);
// Get list of bookmarks listed in TOC
ArrayList tocitems = new ArrayList();
foreach (Aspose.Words.Fields.FieldStart fstart in nodes)
{
    if (fstart.FieldType == Aspose.Words.Fields.FieldType.FieldPageRef)
    {
        String fieldText = GetFieldCode(fstart);
        if (fieldText.Contains("_Toc"))
        {
            fieldText = fieldText.Substring(fieldText.IndexOf("_Toc"), fieldText.Length - fieldText.IndexOf("_Toc")).Replace("\\h", "").Trim();
            tocitems.Add(fieldText);
        }
    }
}
tocitems.Add("_TocEnd");
LBL_TOTAL_TOC_ITEMS.Text = tocitems.Count.ToString();
// **************************************************

that fixed the issue.
THANK YOU so much!!