Adding text paragraph overlaps previous text if the new paragraph text is too long

calphius · May 4, 2021, 9:25pm

I’m writing a piece of code that will go in and replace a specific set of texts with dynamic text being loaded from elsewhere. Everything is working correctly except that if the text is more than two lines long, it overlaps with the content above it. I feel like the issue is that the text paragraph is being rendered up the page rather than down, such that the last line of the new text is at the same position of the old text (and at the same time, the new text not moving the other content to make room).

This is the code I have written so far:

var txtAbsorber = new TextFragmentAbsorber(@"[([A-Z|a-z|/|~].*?)]");
txtAbsorber.TextSearchOptions = new TextSearchOptions(true);
txtAbsorber.TextReplaceOptions.ReplaceAdjustmentAction = TextReplaceOptions.ReplaceAdjustment.WholeWordsHyphenation;

        PDFDocument.Pages.Accept(txtAbsorber);

        foreach (TextFragment fragment in txtAbsorber.TextFragments)
        {
            var processedToken = ProcessPDFToken(fragment.Text, data);
            TextFragment newTextFrag = new TextFragment(processedToken.Value);
            if (processedToken.FontSize > 0)
                newTextFrag.TextState.FontSize = processedToken.FontSize;
            fragment.Text = "";
            TextParagraph par = new TextParagraph();
            par.Position = fragment.Position;
            par.FormattingOptions.WrapMode = TextFormattingOptions.WordWrapMode.ByWords;
            par.HorizontalAlignment = processedToken.Position;
            par.AppendLine(newTextFrag);
            TextBuilder textBuilder = new TextBuilder(fragment.Page);
            textBuilder.AppendParagraph(par);
        }

Where the ProcessPDFToken function does the work to look at which token is being currently loaded and then replacing it with the correct value.
I’ve tried to use ‘VerticalAlignment’ on the TextParagraph object, but all values except for ‘None’ and ‘Bottom’ seem to end up rendering nothing. I’ve also tried (my first attempt) to just do replace using text paragraph using the exact same code except without the new TextFragment object, but that gives me an “Object reference not set to an instance of an object.” exception down the line on the AppendParagraph() call.
The original PDF that’s being ingested by my code: Equipping-Families Before Code.pdf (235.6 KB)

The PDF after my code has ran: Equipping-Families After Code.pdf (232.3 KB)

mudassir.fayyaz · May 5, 2021, 12:41pm

@gandalfwg

Your code includes a method ProcessPDFToken that is not defined. Please share its definition so that we may investigate it accordingly.

calphius · May 5, 2021, 3:12pm

Here is the ProcessPDFToken function:

public PDFToken ProcessPDFToken(string token, TemplateData tData)
            {
                PDFToken processedToken = new PDFToken() { Value = token, Position = HorizontalAlignment.Left };

                TemplateObject o = new TemplateObject();
                o.ObjectType = TemplateObjectType.Token;

                var data = token.Substring(1);
                data = data.Substring(0, data.Length - 1);

                string[] attrib = data.Split(':');
                o.Token = attrib[0].ToLower(System.Globalization.CultureInfo.InvariantCulture);
                o.Attributes = new string[attrib.Length - 1];
                for (int x = 1; x < attrib.Length; x++)
                {
                    o.Attributes[x - 1] = attrib[x];
                    if (attrib[x].ToLower(System.Globalization.CultureInfo.InvariantCulture) == "not")
                    {
                        o.HasNot = true;
                    }
                }
                o.FullToken = data;
                o.OrigionalToken = token;

                var fmb = new FormModuleBase();
                var control = new System.Web.UI.Control();
                int tokenIdx = 0;
                fmb.ProcessToken(o, new TemplateObjectCollection(), control, tData, ref tokenIdx);
                if(control.Controls.Count > 0)
                {
                    if (control.Controls[0].GetType() == typeof(System.Web.UI.LiteralControl))
                    {
                        if(o.Token == "field")
                        {
                            if(attrib.Length > 2)
                                processedToken.Position = ProcessFieldPosition(attrib[2]);
                            if(attrib.Length > 3)
                                processedToken.FontSize = ProcessFieldSize(attrib[3]);
                        }
                        var literalControl = (System.Web.UI.LiteralControl)control.Controls[0];
                        processedToken.Value = literalControl.Text;
                    }
                }

                if (processedToken.Value == token)
                    processedToken.Value = "";

                return processedToken;
            }

It calls a function called ‘ProcessToken’ which is just a massive switch statement based on the words picked up by the ‘ProcessPDFToken’ function and is several hundred lines long. It takes those words and creates Literal Controls with its replacement.
Here is a code snippet of the relevant part of that ‘ProcessToken’ function:

string v = fieldInfo.GetValue(FieldViewMode.Value, part, a3, a4);
LiteralControl c = new LiteralControl(d);
ctrl.Controls.Add©;
return c;

Which calls another function ‘GetValue’ here:

public override string GetValue(FieldViewMode mode, string a1, string a2, string a3)
        {
            string val = ValueText[0];
            try {
                if(!val.Contains("<") || !val.Contains(">"))
                {
                    if (mode == FieldViewMode.Email && Type == FieldType.Memo)
                        val = HttpUtility.HtmlDecode(val.Replace("\n", "<br />"));
                }             
            } catch {
                val = "";
                throw new InvalidOperationException("XSS injection was attempted.");
            }
            return HttpContext.Current.Server.HtmlDecode(val);
        }

mudassir.fayyaz · May 6, 2021, 6:10am

@gandalfwg

I am afraid this code can not still be compiled. Can you please create a sample application with narrowed down code so that we may try to reproduce the same on our end.

Please note that the contents in PDF files can not move to make room for new content because PDF is fixed layout format. So we can expect overlapping of text if it is replaced with different length contents.

calphius · May 6, 2021, 3:38pm

I tried to convert the PDF to HTML before just sticking in new content in order to hopefully resolve the issue (and then convert that HTML back to PDF), but I can’t seem to use the HTML convert using a memory stream: I have to use a file which I can’t do for this project. Do you have a suggestion on a specific conversion type that could do what I need to do? Or, maybe a better idea, is there some way to grab all of the content below the content I’m selecting and just moving the positions of those contents down to make room for the new text? Or, can I get all of the content in a PDF file and then create a new PDF using those content items while replacing the ones I need to? That way, I’m assuming, I can just position them as needed as I fill out the new PDF.

calphius · May 6, 2021, 5:35pm

Update: I did the PDF to HTML conversion by creating a new file and it worked perfectly on the HTML side, but when I tried to convert the HTML file back into a PDF it lost all of its styling. I feel like I’m missing something: there has got to be a way of replacing text in a PDF with longer text.

mudassir.fayyaz · May 6, 2021, 8:41pm

@gandalfwg

You can try to convert the PDF to a DOCX file and then add the text. Later convert that word file to PDF file as per your requirements. You may explore Aspose.Words for .NET API to work with word files.