TextFragmentAbsorber not finding text extracted

DarrenWray · June 7, 2023, 4:14pm

There is a difference in the text extracted using TextAbsorber, and the text that is searched using the TextFragmentAbsorber.

Attached is some recreation code and an example PDF. The recreation code extracts the text from the PDF, it then splits the text into sentences, and then looks to re-find the sentences in the document so that information about the fonts used can be obtained from the text fragment.

As you will see the code tries to locate the first sentence and fails. I can’t see any valid reason for this.

Any help or suggestions are appreciated.

Darren

Test1.pdf (106.8 KB)

using Aspose.Pdf;
using Aspose.Pdf.Text;

namespace TextExtractorRecreationPack
{
    internal class Program
    {
        private static readonly string strAsposeLicense = "Aspose.Total.NET.lic";

        static void Main(string[] args)
        {
            License license = new License();
            license.SetLicense(strAsposeLicense);

            // Get the text from the Pdf file
            string strFileName = "./Test1.pdf";
            Document pdfDocument = new Document(strFileName);

            // Extract the text and split into sentences
            string strPdfContent = ExtractText(pdfDocument);
            string[] strPdfSentences = strPdfContent.Split('\n');

            // Re-find the sentences
            TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber(strPdfSentences[0]);
            
            pdfDocument.Pages[1].Accept(textFragmentAbsorber);

            if (textFragmentAbsorber.TextFragments.Count > 0)
            {
                // Do something with text fragment here
            }
            else
            {
                Console.WriteLine("Sentence not found!");
            }

        }

        public static string ExtractText(Document pdfDocument)
        {
            TextAbsorber textAbsorber = new TextAbsorber();

            pdfDocument.Pages.Accept(textAbsorber);
            
            return(textAbsorber.Text);
        }
    }
}

asad.ali · June 7, 2023, 11:47pm

@DarrenWray

We have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.

Issue ID(s): PDFNET-54761

You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.

DarrenWray · August 7, 2023, 2:27pm

Any update on this issue?

asad.ali · August 7, 2023, 10:03pm

@DarrenWray

Unfortunately, the ticket could not get resolved due to other issues in the queue logged prior to it. We will inform you via this forum thread as soon as we complete our investigation and have some news about fix ETA. Please spare us some time.

We are sorry for your inconvenience.

DarrenWray · August 9, 2023, 6:26pm

Thanks for the update - Can you tell me how many issues are in front of this issue - As this issue is already two months old and is delaying the release of our product.

Thanks in advance,

Darren

asad.ali · August 10, 2023, 12:11am

@DarrenWray

There are different type of issues that we have been working on. Also, the issues reported in priority support have precedence over the tickets logged in free support model. Nevertheless, we have recorded your concerns already and will surely consider them during ticket investigation. We will inform you once we have some news about the ticket ETA. We highly appreciate your patience in this matter. We apologize for the inconvenience.

DarrenWray · August 11, 2023, 3:42pm

I do appreciate that there is always a queue - however, I’m not looking for support, the items I’ve reported are bugs not support requests - I know I’m changing minds here but, I fully expect to wait for “free support” but a bug is a bug is a bug.

I am having to seriously consider going back to e-IceBlue, who have a similar support model but we have never had to wait for more than 5 weeks for a bug fix and would often have confirmation that it was being worked on by devs within 2 weeks of submission.

asad.ali · August 11, 2023, 7:35pm

@DarrenWray

First of all, please accept our humble apology for the inconvenience you have been facing due to this error. We have recorded your concerns and the ticket has been being investigated at the moment. Hopefully, we will have some information and updates for you during next/coming week. Please note that we do realize the severity of this matter and we will try our best to complete our investigation as soon as possible. We again apologize for the trouble caused for you.

asad.ali · August 29, 2023, 3:41pm

@DarrenWray

After investigating the issue, we can confirm some inconsistencies when finding whole sentences on the document page.

The reason for such behavior is that the TextFragmentAbsorber by default searches the text in the raw mode, which extracts text in order how it’s physically placed in the document.

Some document’s contents can be arranged in a such way that visually one sentence can consist of a few fragments, placed not sequentially, therefore you can face some difficulties to locate sentences.
To solve this issue we added some changes to the library, that will enable you to extract text using TextFragmentAbsorber in flatten mode.

In this mode, TextFragmentAbsorber will extract sentences how they look on the page, but omitting extra white spaces.

To use the new functionality you need to modify your code as follows:

using Aspose.Pdf;
using Aspose.Pdf.Text;

namespace TextExtractorRecreationPack
{
    internal class Program
    {
        private static readonly string strAsposeLicense = "Aspose.Total.NET.lic";

        static void Main(string[] args)
        {
            License license = new License();
            license.SetLicense(strAsposeLicense);

            // Get the text from the Pdf file
            string strFileName = "./Test1.pdf";
            Document pdfDocument = new Document(strFileName);

            // Extract the text and split into sentences
            string strPdfContent = ExtractText(pdfDocument);
            string[] strPdfSentences = strPdfContent.Split('\n');

            // Re-find the sentences
            TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber(strPdfSentences[0]);

            // Use TextFragmentAbsorber in Flatten mode
            textFragmentAbsorber.ExtractionOptions = new TextExtractionOptions(TextExtractionOptions.TextFormattingMode.Flatten);

            pdfDocument.Pages[1].Accept(textFragmentAbsorber);

            if (textFragmentAbsorber.TextFragments.Count > 0)
            {
                // Do something with text fragment here
            }
            else
            {
                Console.WriteLine("Sentence not found!");
            }

        }

        public static string ExtractText(Document pdfDocument)
        {
            // Use TextAbsorber in Flatten mode
            TextAbsorber textAbsorber = new TextAbsorber
            {
                ExtractionOptions = new TextExtractionOptions(TextExtractionOptions.TextFormattingMode.Flatten)
            };

            pdfDocument.Pages.Accept(textAbsorber);

            return(textAbsorber.Text);
        }
    }
}

You will be able to use these new enhancements with upcoming release i.e. 23.9. We will send a notification in this forum thread once the release is published.

aspose.notifier · September 14, 2023, 10:24pm

The issues you have found earlier (filed as PDFNET-54761) have been fixed in Aspose.PDF for .NET 23.9.