Creating "manual" TOC for existing document

I was requested to build a service which should add a very customized TOC to existing Word documents. Those documents already have an empty space (2nd page), and TOC should be added to that page.
Documents already have TC entry fields as well.

Using Aspose.Words, Is it possible to

  • get all the existing TC Entry fields in a document
  • get their page numbers
  • create a “manual” TOC using the selected ones (i.e. level 1 only, level 1 and 2, and so on)

By “manual” I mean not using a TOC field, rather simply adding properly formatted lines with page numbers

Thanks

@rioka68,

Please ZIP and upload your 1) sample input Word document and 2) MS Word generated expected document which shows the final output here for our reference. We will then provide you code to achieve the same by using Aspose.Words.

SampleDocs.zip (258.4 KB)
@awais.hafeez

Here are 3 Word files

  • BaseDocument.docx is the document content from other files should be appended to
  • ExpectedResult.docx is the output we’d like to generate
  • SampleInputDocument.docx is a sample document whose content should be appended to BaseDocument, and from which a manual TOC should be created (normally there would be many such files, one for each “level 1” entry in TOC)

Thanks

@rioka68,

Please try using the following code:

Document doc = new Document(MyDir + @"SampleDocs\BaseDocument.docx");
Document inDoc = new Document(MyDir + @"SampleDocs\SampleInputDocument.docx");

DocumentBuilder builder = new DocumentBuilder(doc);
foreach(Paragraph para in doc.GetChildNodes(NodeType.Paragraph, true))
{
    if (para.ToString(SaveFormat.Text).Trim().Equals("INDICE"))
    {
        builder.MoveTo(para.NextSibling.NextSibling.NextSibling);

        builder.ParagraphFormat.TabStops.Add(72 * 6, TabAlignment.Right, TabLeader.None);
        builder.Font.Name = "Futura Std Medium";
        builder.Font.Size = 14;   
                    
        break;
    }
}

doc.AppendDocument(inDoc, ImportFormatMode.KeepSourceFormatting);

LayoutCollector collector = new LayoutCollector(doc);
foreach (Paragraph para in doc.LastSection.Body.GetChildNodes(NodeType.Paragraph, true))
{
    if (para.Runs.Count > 0 &&
        !string.IsNullOrEmpty(para.ToString(SaveFormat.Text).Trim()) &&
        para.Runs[0].Font.Bold &&
        para.Runs[0].Font.Size == 20)
    {
        int pageNumber = collector.GetStartPageIndex(para.Runs[0]);
        builder.Write(para.ToString(SaveFormat.Text).Trim() + ControlChar.Tab + pageNumber + ControlChar.ParagraphBreak);

    }
}

doc.Save(MyDir + @"SampleDocs\18.5.docx");

Thanks @awais.hafeez
Quite close to the desired output… not entirely though, e.g. pages coming from the new document have no footer (make sense, as a new section is started)

Trying to change the code so that the result matches expectations… I’ll keep you updated

@rioka68,

Please see Aspose.Words generated output: 18.5.zip (114.3 KB)

Can you please explain this problem by creating a comparison screenshot highlighting (encircle) the problematic areas in this 18.5.docx and attach it here for our reference?

Maybe you can fix this issue by using the following code: (see 18.5-new.zip (102.0 KB))

Document doc = new Document(MyDir + @"SampleDocs\18.5.docx");

for (int i = 1; i < doc.Sections.Count; i++)
{
    Section sec = doc.Sections[i];
    sec.HeadersFooters.LinkToPrevious(true);

    // And more code maybe
    sec.PageSetup.LeftMargin = doc.FirstSection.PageSetup.LeftMargin;
    sec.PageSetup.TopMargin = doc.FirstSection.PageSetup.TopMargin;
    sec.PageSetup.RightMargin = doc.FirstSection.PageSetup.RightMargin;
    sec.PageSetup.BottomMargin = doc.FirstSection.PageSetup.BottomMargin;
}

doc.Save(MyDir + @"SampleDocs\18.5-new.docx");

@awais.hafeez
Thanks for your support

Here is the best result I can get
MyCode.zip (251.1 KB)
Copying paragraphs one by one keeps headers and footer as they are in the base document
Also chose to identify titles to be included in the TOC using fields, as matching text and fonts was not error proof for some documents. Moreover, I am interested in top level entries for now, but requirements might change in the future, so that is a safer solution.

4 files in the zip

  • ExpectedResult.docx is the manually composed document
  • Expected_v18.5-c15221f2-675e-4ba0-ac0d-a334249840cd.docx is the result got using the attached code
  • source.cs is the code used
  • discrepancies.png highlights the two main issues, i.e.
    • 5 empty paragraph at the beginning of the first imported paragraph (beginning of page 5)
    • the gray rectangle (again in page 5), unable to understand where it come from

Any hints?

@rioka68,

Please try using the following code. Hope, this helps.

Document doc = new Document(MyDir + @"SampleDocs\BaseDocument.docx");
Document inDoc = new Document(MyDir + @"SampleDocs\SampleInputDocument.docx");

DocumentBuilder builder = new DocumentBuilder(doc);

var lastPara = doc.LastSection.Body.GetChildNodes(NodeType.Paragraph, true).ToArray().Last();
builder.MoveTo(lastPara);
// force a page break before appending paragraph from other document (inDoc)
builder.InsertBreak(BreakType.PageBreak);

builder.InsertDocument(inDoc, ImportFormatMode.KeepSourceFormatting);

foreach (Paragraph para in doc.GetChildNodes(NodeType.Paragraph, true))
{
    if (para.ToString(SaveFormat.Text).Trim().Equals("INDICE"))
    {
        builder.MoveTo(para.NextSibling.NextSibling.NextSibling);

        builder.ParagraphFormat.TabStops.Add(72 * 6, TabAlignment.Right, TabLeader.None);
        builder.Font.Name = "Futura Std Medium";
        builder.Font.Size = 14;

        break;
    }
}

LayoutCollector collector = new LayoutCollector(doc);
foreach (Paragraph para in doc.LastSection.Body.GetChildNodes(NodeType.Paragraph, true))
{
    if (para.Runs.Count > 0 &&
        !string.IsNullOrEmpty(para.ToString(SaveFormat.Text).Trim()) &&
        para.Runs[0].Font.Bold &&
        para.Runs[0].Font.Size == 20)
    {
        int pageNumber = collector.GetStartPageIndex(para.Runs[0]);
        builder.Write(para.ToString(SaveFormat.Text).Trim() + ControlChar.Tab + pageNumber + ControlChar.ParagraphBreak);

    }
}

doc.Save(MyDir + @"SampleDocs\18.5.docx");

@awais.hafeez
I finally managed to get the required output, and I’m definitely satisfied.

Next step is to get a PDF from the Word document.

I’m facing a problem though, as a part of the document is missing in the PDF generated saving the file as PDF
The code to convert the file as simple as

doc.Save(saveTo + ".pdf", new PdfSaveOptions() {
  EmbedFullFonts = true,
  SaveFormat = SaveFormat.Pdf
});

Here are sample files to show this issue
WordToPdf.zip (2.3 MB)
In the attached zip file, you can see
20180513-000643.docx, the source Word document (generated a modified version of your code)
20180513-000643.pdf, the PDF resulting saving the file as PDF with Aspose.Words
20180513-000643-ExportFromWord.pdf, the PDF generated directly from Word

The 3rd page of the Word document contains a (sort of) TOC (created manually, without a TOC field, i.e. it is simple text, not a field).
That content is missing from the PDF file (20180513-000643.pdf) generated using Aspose.Words.
The same content is in the PDF generated using Word “Export” feature instead.

What’s wrong with that text?
BTW I’m using Aspose.Words version 18.5.0.0

@rioka68,

Please call Document.UpdatePageLayout Method before saving to PDF. Hope, this helps.

@awais.hafeez
:fireworks:
thanks, that did the trick!