Read TOC from existing Word doc

Norashlea · September 13, 2005, 11:10pm

Are there some code examples of how to use Aspose.Word and/or Aspose.PDF to read the TOC fields in an existing Word doc? My situation is:

I’m using Aspose.Word and Aspose.PDF to generate a PDF file from existing Word documents. The Word document has all the Header# fields defined, but does not contain the generated TOC.

Creating the PDF file is working well, and I was hoping to be able to extract the Header1/Header2/Header3 etc fields from the Aspose-generated XML file to store in a database table, but there is nothing about the nodes created in the XML file that can clearly identify either the Header level or the sequence of entries. There is an ID field, however the sequence numbering often appears to be out of sync with the order of entries in the Word doc, and I suspect this is if the Word doc is subsequently edited.

This is my code for creating the pdf file:

public byte[] createPdf(string input)
{
string strFilename = System.IO.Path.GetFileName(input);
string workPath = input.Substring(0, input.Length - strFilename.Length);
string fileBase = System.IO.Path.GetFileNameWithoutExtension(strFilename);
string xmlFile = workPath + fileBase + “.xml”;
string pdfFile = workPath + fileBase + “.pdf”;
try
{
Aspose.Word.Document doc = new Aspose.Word.Document(input);
doc.Save(xmlFile, Aspose.Word.SaveFormat.FormatAsposePdf);
Aspose.Pdf.Pdf pdf = new Aspose.Pdf.Pdf();
pdf.BindXML(xmlFile, null);
pdf.IsImagesInXmlDeleteNeeded = true;
pdf.Save(pdfFile);
return pdf.GetBuffer();
}
catch (Exception exc)
{
throw exc;
}
}

Thanks,
Sharon.

DmitryV · September 14, 2005, 2:37pm

Hi,

Thank you for considering Aspose.

So, if I understand you correctly, you need to extract all paragraphs of HeadingX style from a Word document, don’t you?

Norashlea · September 14, 2005, 3:49pm

Hi Dmitry,

Yes, I think that’s what I need to do.

(Please excuse that I’m not a very experienced c# programmer!)

Sharon.

DmitryV · September 15, 2005, 5:08am

Use the following code to extract all paragraphs of HeadingX style (in this example X is 1-3) from the document:

NodeList paras = doc.SelectNodes("//Paragraph");

foreach (Paragraph para in paras)
{
switch (para.ParagraphFormat.StyleIdentifier)
{
case StyleIdentifier.Heading1:
case StyleIdentifier.Heading2:
case StyleIdentifier.Heading3:
// This para style is HeadingX
break;
}
}

Feel free to post your further questions here if something is not clear.

Norashlea · September 15, 2005, 3:37pm

Thank you Dmitry, I will give this a try.

Sharon.

Norashlea · September 15, 2005, 7:10pm

Thanks Dmitry, this got me on the right track, and I can write the TOC in a HTML string, but there’s still a minor problem:

I hadn’t noticed before, but Word’s outline numbering for the Heading styles disappears. This means that the PDF document doesn’t have the outline numbers displayed, and neither does the TOC.

In the TOC this becomes a problem because all entries are at the same “level”.

I need the TOC to correspond to the PDF document, so I’ve been trying to create the HTML string (which will be stored in a SQL Server table and read when needed) so that each row will display with the correct indentation – ie, no numbering or bullets, just indented according to the heading level. I’ve done it ok by included where I want indents, but I don’t think this is a very good practice for the html??

So: can you tell me whether it’s possible to include the Word outline numbering with the heading, in both the PDF document, and the NodeList that I’m getting with para.GetText().ToString()

Thanks,
Sharon.

DmitryV · September 16, 2005, 8:48am

Yes Sharon, outline numbering is unfortunately not yet supported. So currently the only way to handle level indentation for TOC is determining what Heading# style the corresponding paragraphs belong to.