How can I replace placeholder of a pdf document with another word document?

Hi. I have a pdf document named “ΠΛΗΓ_ΑΒ_ΠΔ_1_0172.pdf” and I would like to replace placeholder [ΠΛΗΓ_ΑΒ] with word document “ΠΛΗΓ_ΑΒ.docx”. This placeholder is between text of the page of the “ΠΛΗΓ_ΑΒ_ΠΔ_1_0172.pdf” , as you can see from the document I attach. How should I do that? As a result of my unsuccessful attempts, I tried to convert “ΠΛΗΓ_ΑΒ_ΠΔ_1_0172.pdf” to a word document and insert “ΠΛΗΓ_ΑΒ.docx” by replacing placeholder [ΠΛΗΓ_ΑΒ], but still I got the wrong results. How can I achieve this?

ΠΛΗΓ_ΑΒ.docx (58.1 KB)

ΠΛΗΓ_ΑΒ_ΠΔ_1_0172.pdf (412.4 KB)

@panCognity Do you use Aspose.Words or Aspose.PDF to convert PDF document to Word?
Generally, you can use the approach suggested in another your thread to replace placeholder with document using Aspose.Words:
https://forum.aspose.com/t/whenever-i-discover-text-that-needs-to-be-replaced-in-my-main-document-how-could-i-insert-a-pdf-document-here/268521/12?u=alexey.noskov

I converted Pdf to Word. Then, I used that code exactly to replace placeholders, because I had the knowledge from that ticket, but it does not do what it supposed to do, when placeholder is between text.

@panCognity Unfortunately, your problem is not clear enough. Could you please create a simplified small documents and elaborate the problem you have encountered?

Here is the code.

ConsoleApp9.zip (214.3 KB)

Here are input files.

Input.zip (498.7 KB)

I cannot keep formating from “ΠΛΗΓ_ΑΒ.docx” or even the destination’s doc format, which is “ΠΛΗΓ_ΑΒ_ΠΔ_1_0172.docx”.

@panCognity Thank you for the test project. But, unfortunately, it does not explain what the actual problem is. So I have asked to create smaller test document to show what exactly does not work.
As I can see the placeholder is properly replaced with the document. Here is the output produce on my side:
out.pdf (420.2 KB)

But since the documents are large and final document is 110 pages it is hard to understand what the actual problem is.

out.pdf (428.2 KB)
ΠΛΗΓ_ΑΒ_ΠΔ_1_0172.pdf (421.2 KB)

I have drawn a yellow rectangle and highlight to your output.pdf to show you where previous text was placed (with placeholder [ΠΛΗΓ_ΑΒ]). It starts at this point and continues at page 22 and on. I have also drawn a green rectangle and highlight to show you where “ΠΛΗΓ_ΑΒ.docx” was positioned after replacement (Green highlight starts at this point and goes on until page 22, that’s the pages of “ΠΛΗΓ_ΑΒ.docx”). If you compare it with the input file “ΠΛΗΓ_ΑΒ_ΠΔ_1_0172.pdf” or “ΠΛΗΓ_ΑΒ_ΠΔ_1_0172.docx”, you can see what I am talking about. (For reasons of the greek language, I upload “ΠΛΗΓ_ΑΒ_ΠΔ_1_0172.pdf” to show you where text is in the input file with a yellow highlight and a green highlight to show you the placeholder.) As you can see from the result (output.pdf), “ΠΛΗΓ_ΑΒ.docx” , did not replaced as it was formatted as it (Take a look at the “ΠΛΗΓ_ΑΒ.docx” alone). You can see there is another formatting. So, my problem here is formatting of the document. I would like to place it inside the document as it formatted originally.

As a result of compression and replacing all other placeholders in the output pdf and docx file, the actual output file is 411 pages and 24MB, which I cannot upload here. Remember, these are bank’s lawsuit documents. That’s why I asked only about this particular placeholder here and tried to upload a small part here with the problem I have for this one.

@panCognity First of all you should note that it is technically difficult and sometimes even impossible t preserve original PDF document document layout after converting it to flow DOCX document. The same applies to the inserted document, due to the flow nature of MS Word documents after inserting content into the document, the whole document is reflowed and this affects whole document layout.
Unfortunately, I do not see any formatting issues with the inserted document. Upon inserting document using DocumentBuidler.InsertDocument method, page setup of the target section is used if the target section is not empty, so due to different page setup content is reflowed. The content after the placeholder is moved to the 22-nd page, which is expected, again due to the flow nature of MS Word documents.

I am ok with the content. What bothers me and my client is the formatting or else layout. These are not the same margins, as they are in “ΠΛΗΓ_ΑΒ.docx”. Maybe that’s the correct word (layout). Do I have to do a page setup to preserve document’s original layout? What do I have to do to make it right? We go live with this one too on Monday. Today is my last day and I will have to have a solution for UAT, in order to go Live on Monday.

Also, I am uploading a small part of my final input docx, where I try to replace placeholder [ΠΛΗΓ_ΑΒ]. If you run the code, you will see the result. The layout is much worse in output.pdf that this code prosuces. How can I solve this?
ΠΛΗΓ_ΑΒ_ΠΔ_1_0172 1.docx (5.0 MB)

I didn’t uploaded it before, because it is a large file. I managed to crop pages, so there is a smaller docx file to upload it here.

@panCognity To preserve original document page setup upon inserting document it is required to copy entire sections from the source document. For example see the following code:

DocumentBuilder builder = new DocumentBuilder((Aspose.Words.Document)e.MatchNode.Document);
builder.MoveTo((Run)runs[0]);
builder.InsertBreak(BreakType.SectionBreakContinuous);
builder.InsertBreak(BreakType.SectionBreakContinuous);
builder.MoveTo(((Section)builder.CurrentSection.PreviousSibling).Body.FirstParagraph);

ImportFormatOptions formatOptions = new ImportFormatOptions();
formatOptions.SmartStyleBehavior = true;

NodeImporter importer = new NodeImporter(doc, builder.Document, ImportFormatMode.UseDestinationStyles, formatOptions);
doc.FirstSection.PageSetup.SectionStart = SectionStart.Continuous;
foreach (Section s in doc.Sections)
    builder.CurrentSection.ParentNode.InsertBefore(importer.ImportNode(s, true), builder.CurrentSection);

But in this case content of the inserted document starts from a new page since section have different page setup and since content with different page setup cannot be place on the same page, content is moved to the next page.

Could you please be more specific? The document are huge and I do not see layout issues in the produced output. It would be much easier if you specify at least page number with layout problem. Here is the output produced by the code with the above modifications and you new input document:
out.pdf (3.8 MB)

Content from the inserted document should start from the page that placeholder exists. If page setup is the problem, how could we solve it?

What i see from the output.pdf you sent me is that I have content of the insert document in page 2, but I would like that to be under the phrase "ΕΝΩΠΙΟΝ ΤΟΥ ΕΙΡΗΝΟΔΙΚΕΙΟΥ ΑΘΗΝΩΝ
" of page 1. Also, the same content from “ΠΛΗΓ_ΑΒ.docx” is shown again in page 33 until page 53. Then it continues with the content of the document ΠΛΗΓ_ΑΒ_ΠΔ_1_0172.docx. The preferred result (output file) would be under "ΕΝΩΠΙΟΝ ΤΟΥ ΕΙΡΗΝΟΔΙΚΕΙΟΥ ΑΘΗΝΩΝ
" of page 1 to be insert content of “ΠΛΗΓ_ΑΒ.docx” and then the content as it was in the ΠΛΗΓ_ΑΒ_ΠΔ_1_0172.docx.

I highlight the problematic sections in your output pdf (I upload it as output_2.pdf). With yellow colour is the original content of ΠΛΗΓ_ΑΒ_ΠΔ_1_0172.docx and with green colour the inserted document “ΠΛΗΓ_ΑΒ.docx”. To show you the duplicate of the inserted document “ΠΛΗΓ_ΑΒ.docx”, I highlight again with light green colour. At this point, there is no layout of the inserted document. But I don’t need it twice. Only once, with it’s layout.
output_2.pdf (4.1 MB)

@panCognity The document you have attached has 75 pages. The output I have attached has 55 pages. and i do not see any content duplication in it. Also in my output the continuation of the target document is on the 33rd page and starts from new page.

I have already explained why page break appears here. It is due to the different in page setup of source and target documents. Changing page setup of one of them might lead to layout differences you would like to avoid.

Also, the input document ΠΛΗΓ_ΑΒ_ΠΔ_1_0172 1.docx you have attached has absolutely positioned elements (frames). it is impossible properly reflow such documents upon inserting new content. Floating content will overlap the inserted flow content. Putting content on separate pages avoids this overlapping issues.

I uploaded before a small part of my input actual docx file (after converted pdf to docx). Have you tried with this one to run the code and see the output? If you try, you will see the same result I describe in my previous comment. Please, tell me if you see the same output.
ΠΛΗΓ_ΑΒ_ΠΔ_1_0172 1.docx (5.0 MB)

P.S: Also, I would like to ask if there is another workaround. I am thinking right now: If I won’t convert main pdf file, which is ΠΛΗΓ_ΑΒ_ΠΔ_1_0172.pdf and instead of this, I will convert “ΠΛΗΓ_ΑΒ.docx” to pdf, how could I insert this pdf (ΠΛΗΓ_ΑΒ.pdf) to my main pdf file (ΠΛΗΓ_ΑΒ_ΠΔ_1_0172.pdf) replacing placeholder [ΠΛΗΓ_ΑΒ]?

@panCognity Here is the output produced on my side using the above attached input document:
out.pdf (3.8 MB)

I do not think that this will work with PDF documents. PDF document is fixed page document, i.e. all elements have absolute position on the page. So with PDF document, you will not be able to replace placeholder and reflow the rest document’s content. You can insert pages into PDF document. But it is better to consult with Aspose.PDF team regarding such approach.

can you send me again the code that does this? I don’t have the same result, as the one you gave me now. Also, send me the input files you used to do that.

@panCognity Sure, here is implementation of IReplacingCallback used on my side:

private class FindAndReplace : IReplacingCallback
{
    public Aspose.Words.Document doc = null;
    public FindAndReplace(Aspose.Words.Document _doc)
    {
        doc = _doc;
    }

    ReplaceAction IReplacingCallback.Replacing(ReplacingArgs e)
    {
        // This is a Run node that contains either the beginning or the complete match.
        Node currentNode = e.MatchNode;

        // The first (and may be the only) run can contain text before the match,
        // in this case it is necessary to split the run.
        if (e.MatchOffset > 0)
            currentNode = SplitRun((Run)currentNode, e.MatchOffset);

        // This array is used to store all nodes of the match for further removing.
        ArrayList runs = new ArrayList();

        // Find all runs that contain parts of the match string.
        int remainingLength = e.Match.Value.Length;
        while (
            (remainingLength > 0) &&
            (currentNode != null) &&
            (currentNode.GetText().Length <= remainingLength))
        {
            runs.Add(currentNode);
            remainingLength = remainingLength - currentNode.GetText().Length;

            // Select the next Run node.
            // Have to loop because there could be other nodes such as BookmarkStart etc.
            do
            {
                currentNode = currentNode.NextSibling;
            }
            while ((currentNode != null) && (currentNode.NodeType != NodeType.Run));
        }

        // Split the last run that contains the match if there is any text left.
        if ((currentNode != null) && (remainingLength > 0))
        {
            SplitRun((Run)currentNode, remainingLength);
            runs.Add(currentNode);
        }

        DocumentBuilder builder = new DocumentBuilder((Aspose.Words.Document)e.MatchNode.Document);
        builder.MoveTo((Run)runs[0]);
        builder.InsertBreak(BreakType.SectionBreakContinuous);
        builder.InsertBreak(BreakType.SectionBreakNewPage);

        ImportFormatOptions formatOptions = new ImportFormatOptions();
        formatOptions.SmartStyleBehavior = true;

        NodeImporter importer = new NodeImporter(doc, builder.Document, ImportFormatMode.UseDestinationStyles, formatOptions);
        doc.FirstSection.PageSetup.SectionStart = SectionStart.Continuous;
        foreach (Section s in doc.Sections)
            builder.CurrentSection.ParentNode.InsertBefore(importer.ImportNode(s, true), builder.CurrentSection);

        foreach (Run run in runs)
            run.Remove();

        return ReplaceAction.Skip;
    }

    private static Run SplitRun(Run run, int position)
    {
        Run afterRun = (Run)run.Clone(true);
        afterRun.Text = run.Text.Substring(position);
        run.Text = run.Text.Substring((0), (0) + (position));
        run.ParentNode.InsertAfter(afterRun, run);
        return afterRun;
    }
}

I see the result you sent me in your out.pdf now. You made some changes to the code, as I can see. Ok. Thanks! Although, I have to find a way to have it all in page 1 after the phrase “ΕΝΩΠΙΟΝ ΤΟΥ ΕΙΡΗΝΟΔΙΚΕΙΟΥ ΑΘΗΝΩΝ”, which is not the same in all document. You see, lawyers, want everything in document to be continuous.

@panCognity I am afraid, this is technically impossible with the ΠΛΗΓ_ΑΒ_ΠΔ_1_0172 1.docx template you have attached, as I have already explained above. You should have flow template. In this case however, you are using Aspose.PDF to convert PDF to DOCX with fixed document layout, which is not useful for editing, since all elements have absolute position on the page.

I see your point. Totally understand that.

Regarding to the desired result of my client, I am thinking. If I had used ΠΛΗΓ_ΑΒ_ΠΔ_1_0172 1.pdf instead of ΠΛΗΓ_ΑΒ_ΠΔ_1_0172 1.docx, again I would have to convert pdf to word to be able to replace it, right?