Whenever I discover text that needs to be replaced in my main document, how could I insert a pdf document here?

panCognity · July 27, 2023, 8:25am

I apologize for taking so long to respond. There was a surgery I had to have. Anyway.

While appending a pdf (or word document) we are experiencing a problem with the page format. Is there a way to insert a document in landscape or portrait while maintaining the main document format?

alexey.noskov · July 27, 2023, 12:08pm

@panCognity Page setup as well as headers/footers in MS Word document are defined per section. If use Document.AppenDocument whole sections from the source documents are copied into the destination documents. So I would suggest you to use Document.AppenDocument method instead of DocumentBuuder.InsertDocument to merge documents together.
Alternatively you can insert a section break at place of placeholder and copy section from source document into the destination document after the section break.

panCognity · July 27, 2023, 12:12pm

Remember, I want to replace the searched word with a Word document. (The PDF documents will eventually be converted to Word documents to be inserted into the main document, so I won’t encounter any problems.) Then I will set again page numbers.

alexey.noskov · July 27, 2023, 5:54pm

@panCognity Could you please attach the input documents you have problems with along with current and expected output documents? We will check the issue and provide you more information.

panCognity · July 28, 2023, 12:27pm

As I mentioned when I opened this ticket, my intention is to insert the appropriate document in the keywords found in SampleMain.pdf, which I have converted to SampleMain.docx. Thus, keywords 4097376683_Kinhsh, 4097362984_Kinhsh, 4097376683_Simvasi, 4097362984_Simvasi found in the SampleMain.docx will be replaced with the same name pdf file, such as 4097376683_Kinhsh.pdf, 4097362984_Kinhsh.pdf, 4097376683_Simvasi.pdf, 4097362984_Simvasi.pdf. (I have used your code above.) 4097376683_Kinhsh.pdf (582.3 KB)
4097362984_Kinhsh.pdf (582.4 KB)
4097376683_Simvasi.pdf (2.5 MB)
4097362984_Simvasi.pdf (2.5 MB)
SampleMain.pdf (1.2 MB)
SampleMain.docx (5.5 MB)
Then I want to restart page numbering into the SampleMain.docx.

alexey.noskov · July 28, 2023, 2:08pm

@panCognity I have modified the code to preserve page size and orientation while inserting document and make continues page numbers in the resulting document:

Document doc = new Document(@"C:\Temp\SampleMain.pdf");

ReplaceEvaluatorFindAndReplaceWithDocument2 callback = new ReplaceEvaluatorFindAndReplaceWithDocument2();
FindReplaceOptions opt = new FindReplaceOptions(FindReplaceDirection.Backward, callback);

doc.Range.Replace(new Regex(@"\[(\d+_\w+)\]"), "", opt);

// Make page numbers continues in the whole document.
foreach (Section s in doc.Sections)
    s.PageSetup.RestartPageNumbering = false;

doc.Save(@"C:\Temp\out.docx");

internal class ReplaceEvaluatorFindAndReplaceWithDocument2 : IReplacingCallback
{
    /// <summary>
    /// This method is called by the Aspose.Words find and replace engine for each match.
    /// </summary>
    ReplaceAction IReplacingCallback.Replacing(ReplacingArgs e)
    {
        Document doc = (Document)e.MatchNode.Document;

        // This is a Run node that contains either the beginning or the complete match.
        Node currentNode = e.MatchNode;

        // The first (and may be the only) run can contain text before the match, 
        // in this case it is necessary to split the run.
        if (e.MatchOffset > 0)
            currentNode = SplitRun((Run)currentNode, e.MatchOffset);

        // This array is used to store all nodes of the match for further deleting.
        List<Run> runs = new List<Run>();

        // Find all runs that contain parts of the match string.
        int remainingLength = e.Match.Value.Length;
        while (
            remainingLength > 0 &&
            currentNode != null &&
            currentNode.GetText().Length <= remainingLength)
        {
            runs.Add((Run)currentNode);
            remainingLength -= currentNode.GetText().Length;

            // Select the next Run node.
            // Have to loop because there could be other nodes such as BookmarkStart etc.
            do
            {
                currentNode = currentNode.NextSibling;
            } while (currentNode != null && currentNode.NodeType != NodeType.Run);
        }

        // Split the last run that contains the match if there is any text left.
        if (currentNode != null && remainingLength > 0)
        {
            SplitRun((Run)currentNode, remainingLength);
            runs.Add((Run)currentNode);
        }

        // Create DocumentBuilder to insert HTML.
        DocumentBuilder builder = new DocumentBuilder(doc);
        // Move builder to the first run.
        builder.MoveTo(runs[0]);
        // insert section break
        builder.InsertBreak(BreakType.SectionBreakNewPage);
        // Put the document before the current DocumentBuilder section.
        string path = $@"C:\Temp\{e.Match.Groups[1].Value}.pdf";
        if (!File.Exists(path))
            return ReplaceAction.Skip;

        Document replacementDocument = new Document(path);
        foreach (Section s in replacementDocument.Sections)
        {
            Section dstSectopn = (Section)doc.ImportNode(s, true, ImportFormatMode.UseDestinationStyles);
            builder.CurrentSection.ParentNode.InsertBefore(dstSectopn, builder.CurrentSection);
        }

        // Delete matched runs
        foreach (Run run in runs)
            run.Remove();

        // Remove empty paragraph if any.
        if(string.IsNullOrEmpty(builder.CurrentParagraph.ToString(SaveFormat.Text).Trim()))
        {
            builder.CurrentParagraph.Remove();
        }

        // Signal to the replace engine to do nothing because we have already done all what we wanted.
        return ReplaceAction.Skip;
    }

    private static Run SplitRun(Run run, int position)
    {
        Run afterRun = (Run)run.Clone(true);
        run.ParentNode.InsertAfter(afterRun, run);
        afterRun.Text = run.Text.Substring(position);
        run.Text = run.Text.Substring((0), (0) + (position));
        return afterRun;
    }
}

panCognity · July 28, 2023, 4:54pm

Thank you for responding to me. In terms of page numbering, I’m fine with it. Trying to figure out how the rest of code works. My goal is to do the following: After converting SampleMain.pdf to SampelMain.docx, I want to replace all keywords (e.g. [4097376683_Kinhsh], [4097376683_Simvasi]) found in SampleMain.docx with related pdf files (e.g. 4097376683_Kinhsh.pdf, 4097376683_Simvasi.pdf). Could you please tell me how I could do that?

alexey.noskov · July 28, 2023, 5:27pm

@panCognity The provided code does exactly what you need. The find/replace mechanism searches for keywords that matches the provided regular expression:

new Regex(@"\[(\d+_\w+)\]")

i.e. placeholders like [11111111_sometext] and in IReplacingCallback implementation insert the document with the name 11111111_sometext.pdf at the matched placeholder:

// Put the document before the current DocumentBuilder section.
string path = $@"C:\Temp\{e.Match.Groups[1].Value}.pdf";
if (!File.Exists(path))
    return ReplaceAction.Skip;

Document replacementDocument = new Document(path);
foreach (Section s in replacementDocument.Sections)
{
    Section dstSectopn = (Section)doc.ImportNode(s, true, ImportFormatMode.UseDestinationStyles);
    builder.CurrentSection.ParentNode.InsertBefore(dstSectopn, builder.CurrentSection);
}

If there is no document with the specified name, the placeholder is skipped.

panCognity · July 28, 2023, 8:27pm

You are absolutely right! Earlier, you mentioned these points, though the results weren’t quite right. The code works fine now that I have rewritten it! Thank you so much! I really appreciate your help!

panCognity · July 28, 2023, 8:37pm

I cannot get the right result if I use the actual files rather than the samples I sent you. Please find attached the actual files and the result file named Sample without Bookmarks_Out.docx.:
4097362984_Kinhsh.pdf (862.4 KB)
4097362984_Simvasi.pdf (58.1 KB)
4097362984_Kinhsh.pdf (862.4 KB)
4097376683_Simvasi.pdf (57.9 KB)
Sample without Bookmarks.pdf (242.8 KB)
Sample without Bookmarks_Out.docx (166.7 KB)

alexey.noskov · July 29, 2023, 4:15am

@panCognity Could you please elaborate the problem in more details? Unfortunately, it is not quite clear what is your expected result. If possible please provide screenshots of the problems and specify on which page the problem is observed.

Also, please note, Aspose.Words is designed to work with MS Word documents. MS Word documents are flow documents and they have structure very similar to Aspose.Words Document Object Model. On the other hand PDF documents are fixed page format documents . While loading PDF document, Aspose.Words converts Fixed Page Document structure into the Flow Document Object Model. Unfortunately, such conversion does not guaranty 100% fidelity.

panCognity · July 29, 2023, 7:17am

A shrinked version of the input document is displayed. Additionally, their image appears on all pages. An inserted pdf document does not show the letters, it appears to be incomprehensible. Attached are some pictures showing the problem.

Though, converting all pdf files (including main document) into docx seems to work. The result is fine! I attach it as Sample without Bookmarks_Out.docx.
Screenshot 2023-07-29 101203.png (14.6 KB)
Screenshot 2023-07-29 101130.png (36.9 KB)
Screenshot 2023-07-29 101059.png (7.8 KB)
Sample without Bookmarks_Out.docx (7.0 MB)

alexey.noskov · July 29, 2023, 2:57pm

@panCognity The problem occurs upon reading PDF document in Aspose.Words DOM.
We have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.

Issue ID(s): WORDSNET-25731,WORDSNET-25732

You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.

panCognity · July 31, 2023, 3:29pm

I look forward to hearing back from you. In the meantime, I have forwarded my email once again to my client, so that he will purchase the licenses necessary to continue developing their .NET application.

alexey.noskov · July 31, 2023, 5:14pm

@panCognity Sure, we will keep you updated and let you know once the issues are resolved or we have more information for you.

aspose.notifier · September 5, 2023, 11:17am

The issues you have found earlier (filed as WORDSNET-25732) have been fixed in this Aspose.Words for .NET 23.9 update also available on NuGet.

panCognity · October 16, 2023, 2:25pm

Thank you! Sorry for being late to reply. I had an accident. Anyway, I will try it out and get back to you If I have any inquiries.

panCognity · October 16, 2023, 2:57pm

The problem still remains. Do you want me to open a deferent ticket with the issue?

alexey.noskov · October 16, 2023, 3:03pm

@panCognity There were reported two issue in this topic:

WORDSNET-25732 - Font size is changed after converting PDF to DOCX
WORDSNET-25731 - Content is damaged after converting PDF to DOCX.

The WORDSNET-25732 has been resolved. WORDSNET-25731 is not resolved yet.

panCognity · October 16, 2023, 3:28pm

Ok. I will wait to be resolved. Thank you for your answer.