Extracting a single line of text sometimes results in multiple text fragments (C#)

e.vandelaar · November 2, 2022, 8:17am

Hi all,

We are currently trying to use Aspose.PDF to extract text from a PDF file on a line-for-line, or paragraph, basis.
However we discovered that extracting the text fragments sometime results in multiple fragments for a single line of text.

For example, the line "Op dit moment ondervind ik veel geluidsoverlast van het bedrijf Jimmie’s Pizza gevestigd aan de " results in these text fragments:

Op dit moment ondervind ik veel geluidsoverlast van het bedrijf
Jimmie
’
s Pizza
gevestigd aan de

We have tried multiple approaches to extract text on a line-by-line basis, through the TextFragmentAbsorber and the ParagraphAbsorber but both yield the same result.

Is there an alternative method we can use to solve this requirement?

The test code I used with the TextFragmentAbsorber:

byte[] pdfFile = File.ReadAllBytes(@"Voorbeeld brief maskeren.pdf");

            // Convert the byte array to a memorystream so it can be processed

            MemoryStream payloadStream = new MemoryStream(pdfFile);

            // Import in Aspose

            Aspose.Pdf.Document pdfDocument = new Aspose.Pdf.Document(payloadStream);

            // Retrieve all the textfragments inside the PDF

            Aspose.Pdf.Text.TextFragmentAbsorber absorber = new Aspose.Pdf.Text.TextFragmentAbsorber();

            pdfDocument.Pages.Accept(absorber);

            Aspose.Pdf.Text.TextFragmentCollection textFragmentCollection = absorber.TextFragments;

            int lNr = 0;

            foreach (Aspose.Pdf.Text.TextFragment textFrag in textFragmentCollection)

            {

                // Get the text fragment

                var textFragment = textFrag.Text;

                Console.WriteLine("Line nr: " + lNr.ToString() + " text: " + textFragment);

                lNr++;

            }

e.vandelaar · November 2, 2022, 8:18am

Also added the test PDF we are processingVoorbeeld brief maskeren.pdf (70.3 KB)

tahir.manzoor · November 2, 2022, 2:46pm

@e.vandelaar

We suggest you please read the following article to achieve your requirements. Hope this helps you.
Extract Paragraph from PDF C#