Merge and Consolidate Citations

We are looking to combine multiple documents that each may contain citations. Currently the citations are included at the original position at the end of each document we combine. Does Aspose have any support for consolidating these citations into a single list at the end of the document (or a specified location)? Is there any support to remove any duplicate references or at least a way we can access the citation and its corresponding number in the document to do this manually? Thanks for any help that can be provided.

@jferguson9018,

To ensure a timely and accurate response, please ZIP and attach the following resources here for testing:

  • Your simplified input Word documents
  • Aspose.Words generated output document showing the undesired behavior
  • Your expected document showing the correct output. You can create expected document by using MS Word. Please also list the steps that you performed in MS Word to create expected document.
  • Please also create a simplified standalone application (source code without compilation errors) that helps us to reproduce your current problem on our end and attach it here for testing.

As soon as you get these pieces of information ready, we will start investigation into your scenario and provide you more information. Thanks for your cooperation.

Thank you for the response. I have put together a quick zip containing the requested information as best as possible. I do not have access to the citation plugin used, so I had to manually generate the result document. In this example the same citations are present in both documents, so the result should be a single set of references at the end of the document (duplicates removed and consolidated). If the references were different in the documents we would need to do some renumbering as well when consolidating, so each item pointing to a reference would point to the correct one if there were changes.

Nothing special was required to be done in the code, just appending documents and saving the output.

Citations.zip (92.7 KB)

@jferguson9018,

In this case, you can get the ‘expected’ output by using the following code:

Document doc = new Document("E:\\Citations\\Citations01.docx");
Document doc1 = new Document("E:\\Citations\\Citations02.docx");

doc.AppendDocument(doc1, ImportFormatMode.KeepSourceFormatting);

int i = 0;
foreach (Field field in doc.Range.Fields)
{
    if (field.Type == FieldType.FieldAddin)
    {
        FieldAddIn addIn = (FieldAddIn)field;
        if (addIn.GetFieldCode().Equals("ADDIN RW.BIB"))
        {
            if (i < 1)
            {
                Paragraph para = (Paragraph)addIn.Start.GetAncestor(NodeType.Paragraph);
                if (para != null)
                {
                    Paragraph prevPara = (Paragraph)para.PreviousSibling;
                    if (prevPara.ToString(SaveFormat.Text).Trim().Equals("References"))
                    {
                        prevPara.Remove();
                    }
                }
                            
                addIn.Remove();
                i++;
            }
            else
            {
                break;
            }
        }
    }
}

foreach(Section sec in doc.Sections)
    sec.PageSetup.SectionStart = SectionStart.Continuous;

doc.Save("E:\\Citations\\19.7.docx");

Thank you for the reply. I think this will work for this exact scenario, but I am not sure it will work in the following scenario. Notice that My first reference is used in both documents references, but the other references are unique. Any numbers pointing to the 3rd reference in the second document should end up pointing to the first reference in the merged document in this scenario.

First Document References
1. My first reference
2. My second reference
3. My third reference

Second Document References

  1. My Second document first reference
  2. My Second document second reference
    3. My first reference

Desire Merge References
1. My first reference
2. My second reference
3. My third reference
4. My Second document first reference
5. My Second document second reference

@jferguson9018,

I think, you can build on the following code to get the desired output. The code removes all FieldAddIn objects and manually writes entries at the end of document.

Document doc = new Document("E:\\Temp\\Citations\\Citations01.docx");
Document doc1 = new Document("E:\\Temp\\Citations\\Citations02.docx");

doc.AppendDocument(doc1, ImportFormatMode.KeepSourceFormatting);

ArrayList list = new ArrayList();
foreach (Field field in doc.Range.Fields)
{
    if (field.Type == FieldType.FieldAddin)
    {
        FieldAddIn addIn = (FieldAddIn)field;
        if (addIn.GetFieldCode().Equals("ADDIN RW.BIB"))
        {
            string[] entries = addIn.DisplayResult.Split(new char[] { '\r' });
            foreach(string entry in entries)
            {
                if (!list.Contains(entry))
                {
                    list.Add(entry);
                }
            }

            addIn.Remove();
        }
    }
}

DocumentBuilder builder = new DocumentBuilder(doc);
builder.MoveToDocumentEnd();
foreach (string entry in list)
{
    // format enteries manually
    // builder.ListFormat.List = doc.Lists.Add(ListTemplate.NumberArabicDot);
    builder.Font.Color = Color.Green;
    builder.Writeln(entry);
}

foreach (Section sec in doc.Sections)
    sec.PageSetup.SectionStart = SectionStart.Continuous;

doc.Save("E:\\Temp\\Citations\\19.7.docx");

Hope, this helps.

Thanks, this was a good start towards accomplishing what we are looking to do. I am including the result from my test of the code for reference.

Merged.zip (21.3 KB)

There are still a couple of issues that we need to handle.

  1. There are still some duplicates in the references area. I suspect this is due to a slight format difference, extra space, or something that I can work to filter out. I don’t think this is a big issue.

  2. Next is renumbering of items. I think this can be accomplished for the references section by stripping out the existing numbers from the string and including a new set when generating the final list. The part I am not sure on is how to update the citations pointing to these references in the documents. If document A has a reference of 7 and document B has a reference of 14 and they are the same, then I need the merged document to point the text from both A and B to reference 7.

Any thoughts on how we can accomplish the renumbering of the citations in the text to match what we consolidate in the references section?

@jferguson9018,

In “Citations01.docx” document, there is a citation 12 after the text ‘inhibitor treatment’. The following code will make it to point to 14th point.

Document doc = new Document("E:\\Temp\\Citations\\Citations01.docx");           

foreach (Field field in doc.Range.Fields)
{
    if (field.Type == FieldType.FieldAddin)
    {
        FieldAddIn addIn = (FieldAddIn)field;
        if (addIn.GetFieldCode().StartsWith("ADDIN RW.CITE"))
        {
            if (addIn.Result.Equals("12"))
            {
                Run run = (Run) addIn.Separator.NextSibling;
                run.Text = "14";
                addIn.Update();
                break;
            }
        }
    }
}

doc.Save("E:\\Temp\\Citations\\19.8.docx"); 

Hope, this helps in achieving what you are looking for.

Thanks. This definitely helped a lot. I still have some clean up to to do, but I am now able to get the results I am looking for with a couple of different reference formats.

I am running into one other issue that there is probably a quick way to handle, but I am not familiar enough with my available options to know it. With one of the reference formats, the included example type, the header text of References, is not included in the addin that gets removed, so I end up with instances of this after each document that gets merged. How can I remove these and add back in if necessary before the reference list I am generating?

@jferguson9018,

The following code retrieves the first occurrence of ‘References’, clones it, removes it from its original position and finally reinserts its clone at the end of document. Hope, this helps in achieving what you are looking for.

Document doc = new Document("E:\\temp\\Citations\\Citations02.docx");

Paragraph targetPara = null;
foreach (Paragraph para in doc.GetChildNodes(NodeType.Paragraph, true))
{
    if (para.ToString(SaveFormat.Text).Trim().Equals("References"))
    {
        targetPara = para;
        break;
    }
}

if (targetPara != null)
{
    Paragraph clone = (Paragraph) targetPara.Clone(true);
    targetPara.Remove();

    doc.LastSection.Body.InsertBefore(clone, doc.LastSection.Body.LastParagraph);
}
        
doc.Save("E:\\Temp\\Citations\\19.8.docx");

Thank you. I was able to resolve my reference heading issue. It ended up being simpler with this implementation to just remove the existing references as described and then add a new paragraph at the correct location with the text of References and formatted appropriately.

Just when I thought I was done and everything was working I did notice one other issue that I am looking into resolving. This approach is not formatting the consolidated references I add to the end. It is only the text, so it is losing the hyperlinks, italics, etc. I thought that switching to addin.Result instead of addin.DisplayResult might resolve the issue as it includes the HYPERLINK field. I am not seeing any additional information regarding italicized text in this result though. Is there anyway I can preserve this in my reference list I output?

@jferguson9018,

Please check the following code if that helps in achieving what you are looking for?

Document doc = new Document("E:\\Temp\\Citations\\Citations01.docx");
Document doc1 = new Document("E:\\Temp\\Citations\\Citations02.docx");

doc.AppendDocument(doc1, ImportFormatMode.KeepSourceFormatting);

ArrayList list = new ArrayList();
Dictionary<string, Paragraph> data = new Dictionary<string, Paragraph>();

foreach (Field field in doc.Range.Fields)
{
    if (field.Type == FieldType.FieldAddin)
    {
        FieldAddIn addIn = (FieldAddIn)field;
        if (addIn.GetFieldCode().Equals("ADDIN RW.BIB"))
        {
            Paragraph start = (Paragraph)addIn.Separator.GetAncestor(NodeType.Paragraph);
            Paragraph end = (Paragraph)addIn.End.GetAncestor(NodeType.Paragraph);

            addIn.Unlink();

            list.Add(start);
            try { data.Add(start.ToString(SaveFormat.Text).Trim(), (Paragraph)start.Clone(true)); } catch { }
            Paragraph para = (Paragraph)start.NextSibling;
            while (para != null && para != end)
            {
                list.Add(para);
                try { data.Add(para.ToString(SaveFormat.Text).Trim(), (Paragraph)para.Clone(true)); } catch { }
                para = (Paragraph)para.NextSibling;
            }

            list.Add(end);
            try { data.Add(end.ToString(SaveFormat.Text).Trim(), (Paragraph)end.Clone(true)); } catch { }
        }
    }
}

for (int i = list.Count - 1; i >= 0; i--)
{
    Paragraph obj = (Paragraph)list[i];
    obj.Remove();
}

foreach (Paragraph obj in data.Values)
{
    doc.LastSection.Body.InsertBefore(obj, doc.LastSection.Body.LastParagraph);
}

foreach (Section sec in doc.Sections)
    sec.PageSetup.SectionStart = SectionStart.Continuous;

doc.Save("E:\\Temp\\Citations\\19.8.docx");

Thanks again for the help. I ran into a number of small issues with some of the test documents, but I was able to get a working implementation in place based on your suggestions.

@jferguson9018,

Thanks for your feedback. It is great that you were able to find what you were looking for. Please let us know any time you have any further queries in future.