Word TOC and content parsing

nitin.mistry.bell.ca · August 15, 2012, 11:02am

Tahir,

Thank you so much.

I will try it out.

Nitin

tahir.manzoor · August 15, 2012, 2:40pm

Hi Nitin,

Please feel free to ask if you have any question about Aspose.Words, we will be happy to help you.

nitin.mistry.bell.ca · August 17, 2012, 2:01pm

3 more question

How can I check if the first page (after TOC) has an image? This is because

I am trying to parse a document that is an ebook so it may have a cover image.

So if the image is present (see 1 above) then how can I save that image to

an image file?

How can I enumerate the book marks in the word doc?

Thanks so much…

nitin.mistry.bell.ca · August 18, 2012, 5:07pm

Hi
The following code fails

String fieldText = GetFieldCode(fstart);

ArrayList extractedNodes = ExtractContent(bookmarkStart, bookmarkEnd, false);

Document doc2 = GenerateDocument(doc, extractedNodes);

because:
GetFieldCode
ExtractContent
GenerateDocument
are undefined.

nitin.mistry.bell.ca · August 20, 2012, 9:33am

Hi
The following code fails

String fieldText = GetFieldCode(fstart);

ArrayList extractedNodes = ExtractContent(bookmarkStart, bookmarkEnd, false);

Document doc2 = GenerateDocument(doc, extractedNodes);

because the methods:
GetFieldCode
ExtractContent
GenerateDocument
are undefined.

tahir.manzoor · August 23, 2012, 4:46am

Hi Nitin,

Thanks for your query. Please find the code of ExtractContent and GenerateDocument method from following documentation link.

https://docs.aspose.com/words/net/how-to-extract-selected-content-between-nodes-in-a-document/

GetFieldCode method:

private static string GetFieldCode(Aspose.Words.Fields.FieldStart fieldStart)
{
    StringBuilder builder = new StringBuilder();
    for (Node node = fieldStart; node != null && node.NodeType != NodeType.FieldSeparator && node.NodeType != NodeType.FieldEnd; node = node.NextPreOrder(node.Document))
    {
        // Use the text only of Run nodes to avoid duplication.
        if (node.NodeType == NodeType.Run)
            builder.Append(node.GetText());
    }
    return builder.ToString();
}

You can get first image from document by using following code snippet.

Document doc = new Document(MyDir + "tocx.docx");
Shape shape = (Shape)doc.GetChild(NodeType.Shape, 0, true);
if (shape.HasImage)
{
    string imageFileName = string.Format(
        "Image.ExportImages.{0} Out{1}", 1, FileFormatUtil.ImageTypeToExtension(shape.ImageData.ImageType));
    shape.ImageData.Save(MyDir + imageFileName);
}

Regarding your following question, Please see the following code snippet to work with Bookmark. Please let us know if this does not help you.

3. How can I enumerate the book marks in the word doc?

Document doc = new Document(MyDir + "tocx.docx");
Console.WriteLine(doc.Range.Bookmarks.Count);
foreach (Aspose.Words.Bookmark bookmark in doc.Range.Bookmarks)
{
    // your code
}

Hope this answers your queries. Please let us know if you have any more queries.

MarkusSallmutter · August 23, 2012, 7:34am

Hi Nitin! Hi Tahir!

I´m sorry for writing in this thread, but I have a similar requirement and your code helped me a lot.

I have to further questions and would be very glad if you could help me.

Is it possible to get information about how many pages the TOC is using?
I like the extracting of the TOC contents very much, but I would like to extract the chapter in one file.
As Example:

1.textextext
1.1.textunder1
1.1.1.textundertextunder1
2.text2
…
…and so on

Now I would like to get the whole content of 1 in a single document including 1.1 and 1.1.1 and the content of 2 in another document including all underchapters

I hope you can understand my requirement and help me on.

best regards

tahir.manzoor · August 24, 2012, 5:25am

Hi Nitin,

Thanks for your query.

1. Is it possible to get information about how many pages the TOC is using?

Word document is flow document and does not contain any information about its layout into lines and pages. Therefore, technically there is no “Page” concept in Word document.

2. I like the extracting of the TOC contents very much, but I would like to extract the chapter in one file.

Please check the code at following forum link for your reference. Hope this helps you. Please let us know if you have any more queries.

https://forum.aspose.com/t/60963

MarkusSallmutter · August 24, 2012, 5:50am

Hi Tahir!

Thanks for your Answer!

I have already managed to get the TOC information how I need it thanks for the information.
I will check the code as soon as possible and let you know if I need more help about it.

Thanks for helping me on Tahir

best regards

nitin.mistry.bell.ca · August 24, 2012, 7:08am

Hi Tahir,

Thank you for the info.

very much appreciated

Nitin

tahir.manzoor · August 27, 2012, 3:51am

Hi Nitin,

Please feel free to ask if you have any question about Aspose.Words, we will be happy to help you.

MarkusSallmutter · September 18, 2012, 6:39am

Hi Tahir!

I´ve checked the link above and this looks very interesting.

I tried to manage get it work for my requirement, but I couldn´t get it.

So it would be great if you could give me little help about this.
The thing I would like to do is to extract the whole content of 1 chapter into a new document. The problem for me is to know where the chapter starts and where it ends. The only information i get is 1 bookmark anywhere in one chapter and I want to extract the whole chapter where the bookmark is.

I don´t know if this is possible to do, but I think there must be a way to realise this with Aspose

best regards

tahir.manzoor · September 19, 2012, 8:15am

Hi Nitin,

Thanks for your query. It would be great if you please share your document. I will work on your document and share the code accordingly. Have you tried the code shared at following forum link?

https://forum.aspose.com/t/53393

MarkusSallmutter · September 19, 2012, 8:47am

Hi Tahir!

Thanks for your answer! Sure I can upload a sample document.

Well about the link… It references the threat we are writing in isn´t it? And I tried to achive my requierment with the code above but I couln´t get it. Any help would be great.

best reagards

tahir.manzoor · September 20, 2012, 5:27am

Hi Nitin,

Thanks for sharing the document. In your scenario, I suggest you to bookmark the contents of each chapter and use code shared at following documentation link. You can extract the contents from a bookmark.

https://docs.aspose.com/words/net/how-to-extract-selected-content-between-nodes-in-a-document/

Please let us know if you have any more queries.

nitin.mistry.bell.ca · April 1, 2015, 3:04pm

The above code works very well for all TOC Items
EXCEPT the LAST item. The Last item’s content is not picked up.
I think this is because of the [i+1] in the code yellow code below.

for (int i = 0; i < tocitems.Count - 1; i++)
{
    BookmarkStart bookmarkStart = doc.Range.Bookmarks[tocitems[i].ToString()].BookmarkStart;
    BookmarkStart bookmarkEnd = doc.Range.Bookmarks[tocitems[i + 1].ToString()].BookmarkStart;
    // Firstly extract the content between these nodes including the bookmark.
    ArrayList extractedNodes = ExtractContent(bookmarkStart, bookmarkEnd, false);
    Document doc2 = GenerateDocument(doc, extractedNodes);
    doc2.Save(MyDir + tocitems[i] + "AsposeOut.docx");
}

tahir.manzoor · April 2, 2015, 7:06am

Hi Nitin,

Thanks for your inquiry. It would be great if you please share following detail for investigation purposes.

Please attach your input Word document.
Please

create a standalone/runnable simple application (for example a Console
Application Project) that demonstrates the code (Aspose.Words code) you used to generate
your output document

Please attach the output Word file that shows the undesired behavior.
Please
attach your target Word document showing the desired behavior. You can
use Microsoft Word to create your target Word document. I will
investigate as to how you are expecting your final document be generated
like.

Unfortunately,
it is difficult to say what the problem is without the Document(s) and
simplified application. We need your Document(s) and simple project to
reproduce the problem. As soon as you get these pieces of information to
us we’ll start our investigation into your issue.

nitin.mistry.bell.ca · April 2, 2015, 9:51am

Hi Tahir,

See attached a sample VS2013 solution.
The sample doc is in the Debug/bin folder.

You will see that I have 11 TOC items in my doc.
But the code only extract the content for 10 TOC items.
The Last item is never not extracted.

The code basically gets all TOC items
Then extracts the content BETWEEN one link and the next sibling.
The problem is that for the LAST TOC item there is NO next sibling,
and therefore does not extract.

tahir.manzoor · April 3, 2015, 4:06am

Hi Nitin,

Thanks for your inquiry. In your case, I suggest you please insert a bookmark at the end of document and add it into toc array list as shown below. Hope this helps you. Please let us know if you have any more queries.

Document doc = new Document(WordFilePath);
DocumentBuilder builder = new DocumentBuilder(doc);
builder.MoveToDocumentEnd();
builder.StartBookmark("_TocEnd");
builder.EndBookmark("_TocEnd");
//***********************************************************
// GET CONTENT
// **********************************************************
NodeCollection nodes = doc.GetChildNodes(NodeType.FieldStart, true);
// Get list of bookmarks listed in TOC
ArrayList tocitems = new ArrayList();
foreach (Aspose.Words.Fields.FieldStart fstart in nodes)
{
    if (fstart.FieldType == Aspose.Words.Fields.FieldType.FieldPageRef)
    {
        String fieldText = GetFieldCode(fstart);
        if (fieldText.Contains("_Toc"))
        {
            fieldText = fieldText.Substring(fieldText.IndexOf("_Toc"), fieldText.Length - fieldText.IndexOf("_Toc")).Replace("\\h", "").Trim();
            tocitems.Add(fieldText);
        }
    }
}
tocitems.Add("_TocEnd");
LBL_TOTAL_TOC_ITEMS.Text = tocitems.Count.ToString();
// **************************************************

nitin.mistry.bell.ca · April 4, 2015, 6:59pm

that fixed the issue.
THANK YOU so much!!