TextAbsorber adds extra space while extracting from pdf

sumitworksimpli · April 12, 2022, 9:50am

Hello,

Here’s the code to extract the text from pdf.

string regexMatch = “[,0-9A-Za-z ]..[.)0-9a-z ]”;
var textFragmentAbsorber = new TextFragmentAbsorber(new Regex(@regexMatch), new TextSearchOptions(true));
textFragmentAbsorber.Phrase = “any information (including any technology, know”;
pdfDocument.Pages[replaceObj[i].pageSeq].Accept(textFragmentAbsorber);
textFragmentCollection = textFragmentAbsorber.TextFragments;

The issue is one of the following fragment which we are analyzing is having extra spaces -
“a. any information (including any technology, know-how, patent application, software, test”

Here is the file for your reference-
Nitrogen.pdf (591.5 KB)

Please let me know if you need any other information.

Regards,
Sumit Awasthi

tahir.manzoor · April 12, 2022, 3:52pm

@sumitworksimpli

Please create a standalone console application (source code without compilation errors) that helps us to reproduce your problem on our end and attach it here for testing. We will investigate the issue and provide you more information on it.

sumitworksimpli · April 13, 2022, 4:59am

Hello,

Here is the required console app-
Nitrogen.pdf (591.5 KB)

    using System;
    using System.Text.RegularExpressions;
    using Aspose.Pdf;
    using Aspose.Pdf.Text;

    namespace TextAbsorberBug
    {
        class Program
        {
            static void Main(string[] args)
            {
                Console.WriteLine("hey bug!");
                string dataDir = @"C:\Nitrogen.pdf";
                var replaceRegex = "any information (including any technology, know";
                using (Document pdfDocument = new Document(dataDir))
                {
                    string regexMatch = "[,0-9*A-Za-z  ].*.[.)0-9a-z  ]";
                    TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber();

                    textFragmentAbsorber = new TextFragmentAbsorber(new Regex(@regexMatch), new TextSearchOptions(true));
                    textFragmentAbsorber.Phrase = replaceRegex;

                    pdfDocument.Pages[1].Accept(textFragmentAbsorber);
                    var textFragmentCollection = textFragmentAbsorber.TextFragments;
                    foreach (TextFragment textFragment in textFragmentCollection)
                    {
                        if (textFragment.Text.Contains(replaceRegex) || textFragment.Text.CompareTo(replaceRegex) == 0)
                        {
                            Console.WriteLine("Working fine");
                            return;
                        }
                    }
                   Console.WriteLine("why extra spaces?");
                }
            }
        }
    }

tahir.manzoor · April 13, 2022, 4:41pm

@sumitworksimpli

We have managed to reproduce the same issue at our side. For the sake of correction, we have logged this problem in our issue tracking system as PDFNET-51636. You will be notified via this forum thread once this issue is resolved.

We apologize for your inconvenience.

LessThan3 · February 3, 2023, 4:44pm

Has this issue been resolved? We have been experiencing the same issue on our system

asad.ali · February 3, 2023, 7:58pm

@LessThan3

We are afraid that the earlier logged ticket has not been yet resolved. However, your concerns have been recorded and we will surely inform you once we have some updates in this regard. Please spare us some time.

We are sorry for the inconvenience.