Find Keywords Tokens in PDF & Word DOCX Documents & Replace with Formatted HTML String using VB.NET

hemant_thote · February 12, 2021, 5:54am

Hi,

I am working on aspose PDF/Word document writing on the basis of token replace. I have token in my document as <<Clauses>> and want to replace it with one of the HTML string. HTML string basically contain text formatting tag and I want that formatting to be maintained when I write that HTML string in PDF/Word document.

Can you please help me by providing sample code?

Sample HTML String :

"<p>Test Clause. <font face="Courier New">Another clause.<span style="font-weight: normal;">&nbsp;Test clause.</span></font></p><hr><p><font face="Courier New">Test CLAUSE/<span style="font-weight: normal;">CLAUSE.</span></font></p><hr><p><font face="Courier New"><span style="font-weight: normal;">Added clause, </span>made it bold.</font></p>"

Language working on : VB .Net

Please find the attached document for reference
image.png (36.5 KB)
VA Deal counter proposal YDM Word Merge All Tokens 2021-02-12_11-05-48.pdf (48.9 KB)

awais.hafeez · February 12, 2021, 1:07pm

@hemant_thote,

Regarding Aspose.Words for .NET, please use the following code to replace keyword/token in MS Word document with HTML string:

string htmlToReplaceWith = "<b><font color='red'>James Bond, </font></b>";

Document doc = new Document();
DocumentBuilder builder = new DocumentBuilder(doc);
builder.Writeln("Hello <CustomerName>,");

FindReplaceOptions options = new FindReplaceOptions(FindReplaceDirection.Backward);
options.ReplacingCallback = new ReplaceWithHtmlEvaluator(htmlToReplaceWith);

doc.Range.Replace(new Regex(@"<CustomerName>,"), String.Empty, options);

doc.Save(@"C:\Temp\\21.2.docx");

private class ReplaceWithHtmlEvaluator : IReplacingCallback
{
    private string _html = "";

    internal ReplaceWithHtmlEvaluator(string html)
    {
        _html = html;
    }

    //This simplistic method will only work well when the match starts at the beginning of a run. 
    ReplaceAction IReplacingCallback.Replacing(ReplacingArgs e)
    {
        // This is a Run node that contains either the beginning or the complete match.
        Node currentNode = e.MatchNode;

        // The first (and may be the only) run can contain text before the match,
        // in this case it is necessary to split the run.
        if (e.MatchOffset > 0)
            currentNode = SplitRun((Run)currentNode, e.MatchOffset);

        // This array is used to store all nodes of the match for further removing.
        ArrayList runs = new ArrayList();

        // Find all runs that contain parts of the match string.
        int remainingLength = e.Match.Value.Length;
        while (
            (remainingLength > 0) &&
            (currentNode != null) &&
            (currentNode.GetText().Length <= remainingLength))
        {
            runs.Add(currentNode);
            remainingLength = remainingLength - currentNode.GetText().Length;

            // Select the next Run node.
            // Have to loop because there could be other nodes such as BookmarkStart etc.
            do
            {
                currentNode = currentNode.NextSibling;
            }
            while ((currentNode != null) && (currentNode.NodeType != NodeType.Run));
        }

        // Split the last run that contains the match if there is any text left.
        if ((currentNode != null) && (remainingLength > 0))
        {
            SplitRun((Run)currentNode, remainingLength);
            runs.Add(currentNode);
        }

        DocumentBuilder builder = new DocumentBuilder((Document)e.MatchNode.Document);
        builder.MoveTo((Run)runs[0]);

        // Replace '<CustomerName>' text with a red bold name.
        builder.InsertHtml(_html);

        foreach (Run run in runs)
            run.Remove();

        return ReplaceAction.Skip;
    }

    private static Run SplitRun(Run run, int position)
    {
        Run afterRun = (Run)run.Clone(true);
        afterRun.Text = run.Text.Substring(position);
        run.Text = run.Text.Substring((0), (0) + (position));
        run.ParentNode.InsertAfter(afterRun, run);
        return afterRun;
    }
}

Regarding Aspose.PDF for .NET, we are investigating the scenario of replacing tokens in PDF file with HTML string and will get back to you soon.

hemant_thote · February 15, 2021, 4:23am

Hi,

Thanks a lot.

Please let me know the code for PDF ASAP. Because PDF is our primary focus here.

Thanks,
Hemant

asad.ali · February 15, 2021, 7:35pm

@hemant_thote

HTML can be added inside PDF using HtmlFragment Class in Aspose.PDF. It does not offer any property to specify the position of HTML where it should render on the page. However, you can add a floating box at the desired location and then add an HTML string inside it. Please check the below sample code snippet which searches the HTML text in the PDF using Regular Expressions and place a floating box with HTML rendered string at the location of found text:

Document doc = new Document(dataDir + "VA Deal counter proposal YDM Word Merge All Tokens 2021-02-12_11-05-48.pdf");
string searchPhrase = @"<p>(.*\s+)style(.*\s+)clause(.*\s+)CLAUSE(.*\s+)normal(.*\s+)New(.*\s+)bold.(.*)";
TextFragmentAbsorber absorber = new TextFragmentAbsorber(searchPhrase, new TextSearchOptions(true));
doc.Pages.Accept(absorber);
var fragmentCount = absorber.TextFragments.Count;
foreach(TextFragment tf in absorber.TextFragments)
{
 Rectangle rect = tf.Rectangle;
 FloatingBox box = new FloatingBox();
 box.Left = (rect.LLX - tf.Page.PageInfo.Margin.Left);
 box.Top = (tf.Page.GetPageRect(true).Height - rect.URY - tf.Page.PageInfo.Margin.Top);
 box.Width = (rect.Width);

 string HTML = @"<p>Test Clause. <font face='Courier New'>Another clause.<span style='font-weight: normal;'>&nbsp;Test clause.</span></font></p><hr><p><font face='Courier New'>Test CLAUSE/<span style='font-weight: normal;'>CLAUSE.</span></font></p><hr><p><font face='Courier New'><span style='font-weight: normal;'>Added clause, </span>made it bold.</font></p>";
 HtmlFragment html = new HtmlFragment(HTML);
 box.Paragraphs.Add(html);

 tf.Page.Paragraphs.Add(box);
 tf.Text = String.Empty;
}
doc.Save(dataDir + "output.pdf");