Whenever I discover text that needs to be replaced in my main document, how could I insert a pdf document here?

panCognity · July 18, 2023, 12:11pm

Hi team.

In my main document, a Word document, I want to find text and replace it with another PDF document. Would it be better to convert the pdf file to a word document before inserting into the main document? Would it be possible to insert a pdf document in this position of my main document when I find text that needs to be replaced? I have a console application in .net using C# and Aspose.Words, Apose.Pdf.

alexey.noskov · July 18, 2023, 12:44pm

@panCognity You can use IReplacingCallbackto insert a document at the placeholder. Here is simple code to achieve this without any source document preprocessing:

Document doc = new Document("C:\\Temp\\in.docx");
Document srcDoc = new Document("C:\\Temp\\dome_pdf.pdf");

FindReplaceOptions options= new FindReplaceOptions(FindReplaceDirection.Backward);
ReplaceEvaluatorFindAndReplaceWithDocument replaceWithDocumentCallback = new ReplaceEvaluatorFindAndReplaceWithDocument();
replaceWithDocumentCallback.ReplacementDocument = srcDoc;
options.ReplacingCallback = replaceWithDocumentCallback;
doc.Range.Replace("<placehoder>", "", options);

doc.Save("C:\\Temp\\out.docx");

internal class ReplaceEvaluatorFindAndReplaceWithDocument : IReplacingCallback
{
    /// <summary>
    /// This method is called by the Aspose.Words find and replace engine for each match.
    /// </summary>
    ReplaceAction IReplacingCallback.Replacing(ReplacingArgs e)
    {
        Document doc = (Document)e.MatchNode.Document;

        // This is a Run node that contains either the beginning or the complete match.
        Node currentNode = e.MatchNode;

        // The first (and may be the only) run can contain text before the match, 
        // in this case it is necessary to split the run.
        if (e.MatchOffset > 0)
            currentNode = SplitRun((Run)currentNode, e.MatchOffset);

        // This array is used to store all nodes of the match for further deleting.
        List<Run> runs = new List<Run>();

        // Find all runs that contain parts of the match string.
        int remainingLength = e.Match.Value.Length;
        while (
            remainingLength > 0 &&
            currentNode != null &&
            currentNode.GetText().Length <= remainingLength)
        {
            runs.Add((Run)currentNode);
            remainingLength -= currentNode.GetText().Length;

            // Select the next Run node.
            // Have to loop because there could be other nodes such as BookmarkStart etc.
            do
            {
                currentNode = currentNode.NextSibling;
            } while (currentNode != null && currentNode.NodeType != NodeType.Run);
        }

        // Split the last run that contains the match if there is any text left.
        if (currentNode != null && remainingLength > 0)
        {
            SplitRun((Run)currentNode, remainingLength);
            runs.Add((Run)currentNode);
        }

        // Create DocumentBuilder to insert HTML.
        DocumentBuilder builder = new DocumentBuilder(doc);
        // Move builder to the first run.
        builder.MoveTo(runs[0]);
        // Insert document.
        builder.InsertDocument(ReplacementDocument, ImportFormatMode.UseDestinationStyles);

        // Delete matched runs
        foreach (Run run in runs)
            run.Remove();

        // Remove empty paragraph if any.
        if(string.IsNullOrEmpty(builder.CurrentParagraph.ToString(SaveFormat.Text).Trim()))
        {
            builder.CurrentParagraph.Remove();
        }

        // Signal to the replace engine to do nothing because we have already done all what we wanted.
        return ReplaceAction.Skip;
    }

    private static Run SplitRun(Run run, int position)
    {
        Run afterRun = (Run)run.Clone(true);
        run.ParentNode.InsertAfter(afterRun, run);
        afterRun.Text = run.Text.Substring(position);
        run.Text = run.Text.Substring((0), (0) + (position));
        return afterRun;
    }

    public Document ReplacementDocument { get; set; }
}

FYI: Aspose.Words has a powerful LINQ Reporting Engine, which you can use instead of find/replace approach to fill the template with data. LINQ Reporting Engine template syntax also uses simple text placeholders in format <<[expression]>>. For example to insert a document, you can use the following syntax
<<doc [document_expression]>>
For example the following code does the same as above, but delegates all the work to LINQ Reporting Engine:

Document doc = new Document(@"C:\Temp\linq_template.docx");
Document srcDoc = new Document("C:\\Temp\\dome_pdf.pdf");
ReportingEngine engine = new ReportingEngine();
engine.BuildReport(doc, srcDoc, "srcDoc");
doc.UpdateFields();
doc.Save(@"C:\Temp\out.docx");

in.docx (14.3 KB)

panCognity · July 19, 2023, 8:42am

Thank you! It seems to work. However, I need to see if inserting many pdf documents into the main document goes smoothly. Also, the trial version doesn’t output all the pages of the main document.

alexey.noskov · July 19, 2023, 8:51am

@panCognity You can request a free 30-days temporary license to test Aspose.Words on your side without evaluation version limitations.

panCognity · July 19, 2023, 9:48am

In addition, I see a problem with the formatting of the pdf file when I insert it into the main document. There is the same problem with another pdf file I tried. Is it possible to try different formatting options?

alexey.noskov · July 19, 2023, 12:18pm

@panCognity Could you please attach the problematic PDF document here for testing? We will check the issue and provide you more information.
Please note, Aspose.Words is designed to work with MS Word documents. MS Word documents are flow documents and they have structure very similar to Aspose.Words Document Object Model. On the other hand PDF documents are fixed page format documents . While loading PDF document, Aspose.Words converts Fixed Page Document structure into the Flow Document Object Model. Unfortunately, such conversion does not guaranty 100% fidelity.

panCognity · July 27, 2023, 8:25am

I apologize for taking so long to respond. There was a surgery I had to have. Anyway.

While appending a pdf (or word document) we are experiencing a problem with the page format. Is there a way to insert a document in landscape or portrait while maintaining the main document format?

alexey.noskov · July 27, 2023, 12:08pm

@panCognity Page setup as well as headers/footers in MS Word document are defined per section. If use Document.AppenDocument whole sections from the source documents are copied into the destination documents. So I would suggest you to use Document.AppenDocument method instead of DocumentBuuder.InsertDocument to merge documents together.
Alternatively you can insert a section break at place of placeholder and copy section from source document into the destination document after the section break.

panCognity · July 27, 2023, 12:12pm

Remember, I want to replace the searched word with a Word document. (The PDF documents will eventually be converted to Word documents to be inserted into the main document, so I won’t encounter any problems.) Then I will set again page numbers.

alexey.noskov · July 27, 2023, 5:54pm

@panCognity Could you please attach the input documents you have problems with along with current and expected output documents? We will check the issue and provide you more information.

panCognity · July 28, 2023, 12:27pm

As I mentioned when I opened this ticket, my intention is to insert the appropriate document in the keywords found in SampleMain.pdf, which I have converted to SampleMain.docx. Thus, keywords 4097376683_Kinhsh, 4097362984_Kinhsh, 4097376683_Simvasi, 4097362984_Simvasi found in the SampleMain.docx will be replaced with the same name pdf file, such as 4097376683_Kinhsh.pdf, 4097362984_Kinhsh.pdf, 4097376683_Simvasi.pdf, 4097362984_Simvasi.pdf. (I have used your code above.) 4097376683_Kinhsh.pdf (582.3 KB)
4097362984_Kinhsh.pdf (582.4 KB)
4097376683_Simvasi.pdf (2.5 MB)
4097362984_Simvasi.pdf (2.5 MB)
SampleMain.pdf (1.2 MB)
SampleMain.docx (5.5 MB)
Then I want to restart page numbering into the SampleMain.docx.

alexey.noskov · July 28, 2023, 2:08pm

@panCognity I have modified the code to preserve page size and orientation while inserting document and make continues page numbers in the resulting document:

Document doc = new Document(@"C:\Temp\SampleMain.pdf");

ReplaceEvaluatorFindAndReplaceWithDocument2 callback = new ReplaceEvaluatorFindAndReplaceWithDocument2();
FindReplaceOptions opt = new FindReplaceOptions(FindReplaceDirection.Backward, callback);

doc.Range.Replace(new Regex(@"\[(\d+_\w+)\]"), "", opt);

// Make page numbers continues in the whole document.
foreach (Section s in doc.Sections)
    s.PageSetup.RestartPageNumbering = false;

doc.Save(@"C:\Temp\out.docx");

internal class ReplaceEvaluatorFindAndReplaceWithDocument2 : IReplacingCallback
{
    /// <summary>
    /// This method is called by the Aspose.Words find and replace engine for each match.
    /// </summary>
    ReplaceAction IReplacingCallback.Replacing(ReplacingArgs e)
    {
        Document doc = (Document)e.MatchNode.Document;

        // This is a Run node that contains either the beginning or the complete match.
        Node currentNode = e.MatchNode;

        // The first (and may be the only) run can contain text before the match, 
        // in this case it is necessary to split the run.
        if (e.MatchOffset > 0)
            currentNode = SplitRun((Run)currentNode, e.MatchOffset);

        // This array is used to store all nodes of the match for further deleting.
        List<Run> runs = new List<Run>();

        // Find all runs that contain parts of the match string.
        int remainingLength = e.Match.Value.Length;
        while (
            remainingLength > 0 &&
            currentNode != null &&
            currentNode.GetText().Length <= remainingLength)
        {
            runs.Add((Run)currentNode);
            remainingLength -= currentNode.GetText().Length;

            // Select the next Run node.
            // Have to loop because there could be other nodes such as BookmarkStart etc.
            do
            {
                currentNode = currentNode.NextSibling;
            } while (currentNode != null && currentNode.NodeType != NodeType.Run);
        }

        // Split the last run that contains the match if there is any text left.
        if (currentNode != null && remainingLength > 0)
        {
            SplitRun((Run)currentNode, remainingLength);
            runs.Add((Run)currentNode);
        }

        // Create DocumentBuilder to insert HTML.
        DocumentBuilder builder = new DocumentBuilder(doc);
        // Move builder to the first run.
        builder.MoveTo(runs[0]);
        // insert section break
        builder.InsertBreak(BreakType.SectionBreakNewPage);
        // Put the document before the current DocumentBuilder section.
        string path = $@"C:\Temp\{e.Match.Groups[1].Value}.pdf";
        if (!File.Exists(path))
            return ReplaceAction.Skip;

        Document replacementDocument = new Document(path);
        foreach (Section s in replacementDocument.Sections)
        {
            Section dstSectopn = (Section)doc.ImportNode(s, true, ImportFormatMode.UseDestinationStyles);
            builder.CurrentSection.ParentNode.InsertBefore(dstSectopn, builder.CurrentSection);
        }

        // Delete matched runs
        foreach (Run run in runs)
            run.Remove();

        // Remove empty paragraph if any.
        if(string.IsNullOrEmpty(builder.CurrentParagraph.ToString(SaveFormat.Text).Trim()))
        {
            builder.CurrentParagraph.Remove();
        }

        // Signal to the replace engine to do nothing because we have already done all what we wanted.
        return ReplaceAction.Skip;
    }

    private static Run SplitRun(Run run, int position)
    {
        Run afterRun = (Run)run.Clone(true);
        run.ParentNode.InsertAfter(afterRun, run);
        afterRun.Text = run.Text.Substring(position);
        run.Text = run.Text.Substring((0), (0) + (position));
        return afterRun;
    }
}

panCognity · July 28, 2023, 4:54pm

Thank you for responding to me. In terms of page numbering, I’m fine with it. Trying to figure out how the rest of code works. My goal is to do the following: After converting SampleMain.pdf to SampelMain.docx, I want to replace all keywords (e.g. [4097376683_Kinhsh], [4097376683_Simvasi]) found in SampleMain.docx with related pdf files (e.g. 4097376683_Kinhsh.pdf, 4097376683_Simvasi.pdf). Could you please tell me how I could do that?

alexey.noskov · July 28, 2023, 5:27pm

@panCognity The provided code does exactly what you need. The find/replace mechanism searches for keywords that matches the provided regular expression:

new Regex(@"\[(\d+_\w+)\]")

i.e. placeholders like [11111111_sometext] and in IReplacingCallback implementation insert the document with the name 11111111_sometext.pdf at the matched placeholder:

// Put the document before the current DocumentBuilder section.
string path = $@"C:\Temp\{e.Match.Groups[1].Value}.pdf";
if (!File.Exists(path))
    return ReplaceAction.Skip;

Document replacementDocument = new Document(path);
foreach (Section s in replacementDocument.Sections)
{
    Section dstSectopn = (Section)doc.ImportNode(s, true, ImportFormatMode.UseDestinationStyles);
    builder.CurrentSection.ParentNode.InsertBefore(dstSectopn, builder.CurrentSection);
}

If there is no document with the specified name, the placeholder is skipped.

panCognity · July 28, 2023, 8:27pm

You are absolutely right! Earlier, you mentioned these points, though the results weren’t quite right. The code works fine now that I have rewritten it! Thank you so much! I really appreciate your help!

panCognity · July 28, 2023, 8:37pm

I cannot get the right result if I use the actual files rather than the samples I sent you. Please find attached the actual files and the result file named Sample without Bookmarks_Out.docx.:
4097362984_Kinhsh.pdf (862.4 KB)
4097362984_Simvasi.pdf (58.1 KB)
4097362984_Kinhsh.pdf (862.4 KB)
4097376683_Simvasi.pdf (57.9 KB)
Sample without Bookmarks.pdf (242.8 KB)
Sample without Bookmarks_Out.docx (166.7 KB)

alexey.noskov · July 29, 2023, 4:15am

@panCognity Could you please elaborate the problem in more details? Unfortunately, it is not quite clear what is your expected result. If possible please provide screenshots of the problems and specify on which page the problem is observed.

Also, please note, Aspose.Words is designed to work with MS Word documents. MS Word documents are flow documents and they have structure very similar to Aspose.Words Document Object Model. On the other hand PDF documents are fixed page format documents . While loading PDF document, Aspose.Words converts Fixed Page Document structure into the Flow Document Object Model. Unfortunately, such conversion does not guaranty 100% fidelity.

panCognity · July 29, 2023, 7:17am

A shrinked version of the input document is displayed. Additionally, their image appears on all pages. An inserted pdf document does not show the letters, it appears to be incomprehensible. Attached are some pictures showing the problem.

Though, converting all pdf files (including main document) into docx seems to work. The result is fine! I attach it as Sample without Bookmarks_Out.docx.
Screenshot 2023-07-29 101203.png (14.6 KB)
Screenshot 2023-07-29 101130.png (36.9 KB)
Screenshot 2023-07-29 101059.png (7.8 KB)
Sample without Bookmarks_Out.docx (7.0 MB)

alexey.noskov · July 29, 2023, 2:57pm

@panCognity The problem occurs upon reading PDF document in Aspose.Words DOM.
We have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.

Issue ID(s): WORDSNET-25731,WORDSNET-25732

You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.

panCognity · July 31, 2023, 3:29pm

I look forward to hearing back from you. In the meantime, I have forwarded my email once again to my client, so that he will purchase the licenses necessary to continue developing their .NET application.