Words split in multiple text fragments (C#)

e.vandelaar · November 7, 2022, 6:16pm

Hi,

We are working with PDF document in which we want to extract and replace words in specific sentences.
We’ve went through the article to extract paragraphs (Extract Paragraph from PDF C#|Aspose.PDF for .NET)and we can get the paragraphs back without any issues.

What we now want to achieve is to replace the text inside the paragraph. What we have found so far is that this has to be done by getting the text fragments inside the paragraph. While working on this we have discovered that pretty frequently words are split across two text fragments.

For instance: the word “telefoonnummer” which exists in a single line spans two fragments (fragment 1: “telefo”, fragment 2: “onnummer”). This is causing us some issues since we would like to replace words like this for another word, but since we cannot locate that word in a single text fragment we are unable to replace it.

I know there is also the option to search for text to replace using the TextFragmentAbsorber but ideally, we want to do this on a line or text fragment basis (for instance by replacing all the words that need replacing in the paragraph or line and then overwriting the existing line with a new one with the specific words replaced).

Is this normal behaviour, and is there an alternative option available to achieve what we are looking for?

asad.ali · November 7, 2022, 9:26pm

@e.vandelaar

Would you kindly share complete code snippet and sample file for our reference as well? We will test the scenario in our environment and address it accordingly.

e.vandelaar · November 8, 2022, 7:46am

Of course, thanks for your help!

I’ve attached a small example PDF that can be used to reproduce the behaviour.
In this case you can see on line 3, fragment 8,9,10 that the line “Zijn telefoonnummer is” is split across three text fragments.

The test code:

        byte[] pdfFile = File.ReadAllBytes(@"test_doc_aspose.pdf");
        MemoryStream payloadStream = new MemoryStream(pdfFile);
        Aspose.Pdf.Document pdfDocument = new Aspose.Pdf.Document(payloadStream);

        Aspose.Pdf.Text.ParagraphAbsorber paraAbsorber = new Aspose.Pdf.Text.ParagraphAbsorber();
        paraAbsorber.Visit(pdfDocument);
    
        var l = 1;
        var f = 1;

        foreach (Aspose.Pdf.Text.PageMarkup markup in paraAbsorber.PageMarkups)
        {

            foreach (Aspose.Pdf.Text.MarkupSection section in markup.Sections)
            {
                  

                foreach (Aspose.Pdf.Text.MarkupParagraph paragraph in section.Paragraphs)
                {                                            
                    
                    StringBuilder paragraphText = new StringBuilder();

                    foreach (List<Aspose.Pdf.Text.TextFragment> line in paragraph.Lines)
                    {                                                       
                                                
                        foreach (Aspose.Pdf.Text.TextFragment fragment in line)
                        {                                
                            Console.WriteLine("Line: " + l.ToString() + ", fragment: " + f.ToString() + ", fragment: " + fragment.Text);
                            paragraphText.Append(fragment.Text);                         

                            f++;

                        }
                        paragraphText.Append("\r\n");   

                        l++;

                    }
                    paragraphText.Append("\r\n");

                }
            }
        }

Thanks!Test_Doc_Aspose.pdf (9.7 KB)

asad.ali · November 8, 2022, 3:45pm

@e.vandelaar

We need to investigate this scenario further to check the feasibility of your requirements. The ticket is logged as PDFNET-52920 in our issue tracking system for this purpose. We will look into its details and keep you posted with the status of its correction. Please be patient and spare us some time.

We are sorry for the inconvenience.

e.vandelaar · November 8, 2022, 4:03pm

Hello,

Thanks for your reply!
Is there any way we can monitor the status of the ticket?

Thanks!

asad.ali · November 8, 2022, 7:07pm

@e.vandelaar

You can check the status of the issue at the bottom of this thread. However, you would not be able to access the link as it is our internal issue tracking system. Also, we will keep you updated via this forum thread about ticket resolution status.

e.vandelaar · November 8, 2022, 7:18pm

Great, thanks for the info!

asad.ali · June 19, 2023, 8:35pm

@e.vandelaar

The text in a PDF document can be represented by two operators TJ and Tj. The difference between the TJ and Tj operators is that TJ is an array, and each segment of text is broken into small chunks. The Tj statement contains the entire text segment, and TextFragment can contain more than one word inside. In this document, text is represented by TJ operators. This means that each TextFragment you receive from the ParagraphAbsorber may not necessarily be associated with a single word, and may be represented as a chunk of a word and this is a normal.

The solution that might be useful for you is to use a TextFragmentAbsorber with a Regex search and sort the TextFragments by their coordinates to get the lines they belong to.

Aspose.Pdf.Document pdfDocument = new Aspose.Pdf.Document(input);
var absorber = new TextFragmentAbsorber(new Regex(@"\w*\w"));
absorber.Visit(pdfDocument.Pages[1]);
var lines = absorber.TextFragments.GroupBy(tf => tf.Position.YIndent);