Issue with TextFragmentAbsorber

Hi,

I’m implementing an application that is to replace text fragments in PDF based on regular expressions. I used the code from this forum as starting point. As long as I worked with an older version of Aspose PDF .Net (11.7), everything worked. I now wanted to use the current version (18.8) instead, and I’m having a lot of problems.

The error appears when I try to accept the textFragmentAbsorber to the current page:

pdfPage.Accept(textFragmentAbsorber);

I get an exception:

Exception thrown: 'System.ArgumentOutOfRangeException' in mscorlib.dll
Index was out of range. Must be non-negative and less than the size of the collection.
Parameter name: startIndex

This appears to happen on PDF/As created with Nuance.

I ask you to have a look at that issue, because I’d like to use the most recent version of the toolkit for my project.

Best regards,
Georg

Please find an example document here: https://files.fm/u/q4j737kn

Code:

(try it with Search Expression “the” and Replace Expression “XXX” for example)

    private void buttonReplace_Click(object sender, EventArgs e)
    {
        try
        {
            // Create a PDF license object and instanciate license file
            Aspose.Pdf.License license = new Aspose.Pdf.License();
            license.SetLicense("Aspose.Pdf.lic");

            // Set the value to indicate that license will be embedded in the application
            license.Embedded = true;

            // Configure file open dialog
            openFileDialog1.Filter = "PDF files (*.PDF)|*.PDF|" + "All files (*.*)|*.*";
            openFileDialog1.Multiselect = true;
            openFileDialog1.Title = "Document selection";
            openFileDialog1.InitialDirectory = Properties.Settings.Default.LastDirectory;
            openFileDialog1.FileName = "";

            // Show file open dialog
            DialogResult result = this.openFileDialog1.ShowDialog();

            // If the user confirmed the selection
            if (result == System.Windows.Forms.DialogResult.OK)
            {
                // For each selected file
                foreach (String fileName in openFileDialog1.FileNames)
                {
                    // Save most recently used directory
                    Settings.Default.LastDirectory = Path.GetDirectoryName(fileName);
                    Settings.Default.Save();

                    // Open PDF document
                    using (Document pdfDocument = new Document(fileName))
                    {
                        int replacementCounter = 0;

                        // Set text search option to specify regular expression usage
                        TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber(textBoxSearchExpression.Text);
                        TextSearchOptions textSearchOptions = new TextSearchOptions(true);
                        textFragmentAbsorber.TextSearchOptions = textSearchOptions;

                        // For each page in the PDF
                        foreach (Page pdfPage in pdfDocument.Pages)
                        {
                            // Accept the absorber
                            pdfPage.Accept(textFragmentAbsorber);

                            // Get the extracted text fragments
                            TextFragmentCollection textFragmentCollection = textFragmentAbsorber.TextFragments;
                            replacementCounter = replacementCounter + textFragmentCollection.Count;

                            // Loop through the fragments (found search expressions)
                            foreach (TextFragment textFragment in textFragmentCollection)
                            {
                                string replacementString = "";

                                // If the replacement string contains group definitions (starting with "$")
                                if (textBoxReplaceExpression.Text.Contains("$"))
                                {
                                    replacementString = Regex.Replace(textFragment.Text, textBoxSearchExpression.Text, textBoxReplaceExpression.Text);
                                }
                                // Else: No groups have been defined, use the replacement string as is
                                else
                                {
                                    replacementString = textBoxReplaceExpression.Text;
                                }

                                // Update text with the replacement expression
                                textFragment.Text = replacementString;
                            }
                        }

                        // Save PDF with replaced text
                        string fileNameStub = Path.Combine(Path.GetDirectoryName(fileName), Path.GetFileNameWithoutExtension(fileName));
                        pdfDocument.Save(fileNameStub + "_replaced.pdf");

                        // Get the text of the (modified) page/document and save it to a text file
                        TextAbsorber textAbsorber = new TextAbsorber();
                        pdfDocument.Pages.Accept(textAbsorber);
                        string outputText = textAbsorber.Text;
                        File.WriteAllText(fileNameStub + "_replaced.txt", outputText);
                    }
                }

                MessageBox.Show("Processed " + openFileDialog1.FileNames.Length + " document(s).", "Replacing text", MessageBoxButtons.OK, MessageBoxIcon.Information);
            }
        }
        catch (Exception ex)
        {
            Debug.Print(ex.Message);
            MessageBox.Show("Error in buttonReplace_Click: " + ex.Message, "AsposeTest", MessageBoxButtons.OK, MessageBoxIcon.Error);
        }

@georg.mahler

Thank you for contacting support.

We have worked with the file shared by you but we have not been able to reproduce the issue. Your code snippet can not be executed as some variables are missing, for example textBoxReplaceExpression. Please update the code snippet so that we may try to reproduce and investigate it in our environment. However, we have tried to replace the text with below code snippet but the issue is not reproduced.

        // Open document
        Document pdfDocument = new Document(dataDir + "page_nuance_0_input.pdf");

        // Create TextAbsorber object to find all instances of the input search phrase
        TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber("the");

        // Accept the absorber for all the pages
        pdfDocument.Pages.Accept(textFragmentAbsorber);

        // Get the extracted text fragments
        TextFragmentCollection textFragmentCollection = textFragmentAbsorber.TextFragments;

        // Loop through the fragments
        foreach (TextFragment textFragment in textFragmentCollection)
        {
            textFragment.Text = "XXX";
        }

        pdfDocument.Save(dataDir + "Test_18.9.1.pdf");

Before sharing updated code snippet, please ensure using Aspose.PDF for .NET 18.9.1 in your environment.

Hi,

I updated the Aspose DLL to the most current version as mentioned in your reply.

Now I have the following problem: I cannot reproduce the problem with my standalone test application with which I wanted to demonstrate the error to you. When I use the PDF that was causing the problem, it works without exception.

The application I need to develop is in a DLL that is used by a third party product. I use exactly the same code there as in my standalone application, but there the exception mentioned in my first post still occurs.

I experimented further and I found out that the exception is only thrown when I use regular expressions with the text absorber, like this:

TextFragmentAbsorber textFragmentAbsorber = new textFragmentAbsorber(@"(?i)\bder\b");
TextSearchOptions textSearchOptions = new TextSearchOptions(true);
textFragmentAbsorber.TextSearchOptions = textSearchOptions;

If I search without using regular expressions, even the code in the DLL works.

It’s not possible to send you the code of the DLL as you do not have the third party product, you wouldn’t be able to test it. Could I ask you nevertheless to have a look at the mechanism regarding the use of regular expressions again? In your example you haven’t used regular expression.

Best regards,
Georg

OK, finally I managed to construct a scenario where I get the error with my standalone application (and without regular expressions by the way, so the assumption in my last post seems to be wrong).

I uploaded the whole Visual Studio project and a testing PDF, with which you should be able to reproduce the issue. At least it works for me:

https://files.fm/u/x7u6k6xz

Best regards,
Georg

@georg.mahler

Thank you for sharing sample application.

We have been able to reproduce the issue in our environment. A ticket with ID PDFNET-45365 has been logged in our issue management system for further investigation and resolution. The ticket ID has been linked with this thread so that you will receive notification as soon as the ticket is resolved.

We are sorry for the inconvenience.

Thank you very much, it’s great that you were able top reproduce the problem! I’m looking forward to hearing from you.

Best regards,
Georg

@georg.mahler

We will update you once the ticket is investigated for further investigations and any significant update will be available in this regard.

Hello team,

my solution is still not usable due to that bug I found 2 months ago. Obviously no one of the development team has touched it so far. I need a fix in the near future and to be honest I think that 2 month are quite a lot of time for you to look at that confirmed problem and find a solution while I see a lot of other issues reported later have been already fixed. I have to admit that I start to get disappointed. Any news for me and my customer?

Best regards,
Georg

@georg.mahler

Thank you for getting back to us.

We definitely value your concerns and realize the significance of the issue. Please note that we provide resolution against every issue reported by our customers - however, issues have been resolved on first come first serve basis, which we believe, is the fairest policy. There is large number of pending issues in the queue which were reported prior to your issue and we have been working over resolving those issues as well as introducing new features and improvements in the API.

Therefore, your ticket may take few more months to resolve. We will let you know as soon as some updates will be available. We really appreciate your patience and comprehension in this regard.

Moreover, we also offer Paid Support, where issues are used to be investigated with higher priority. Our customers, who have paid support subscription, report their issue there which are meant to be investigated urgently. In case your reported issue is a blocker, you may please consider subscribing for Paid Support. For further information, please visit Paid Support FAQs.

Hello,

first of all many thanks for the quick answer. It’s true that you really developed a lot of new features for the API and I think that’s great. Nevertheless, as you have already written, many bugs have been reported. In the context of an agile approach, I think it’s OK that bugs occur during fast development cycles. But it is also imperative that these bugs are fixed within an acceptable time span. Especially when there are so many!

As a customer, even without premium support, I have already paid for the product. And it contains a confirmed error, so it is not flawless, but has a defect. After having waited 2 months already, I hear that it will probably take “several months” until my bug will be fixed. “Several months”, plural, and that with many more mistakes waiting as long for fix! Don’t you think that a consolidation is necessary here and all bugs have to be fixed before the development capacities can be put into further features?

And now as a customer with a faulty product I should pay even more to get a bug fixed in that product without having to wait for - how long? half a year? a whole year? That is really not good business conduct!

Only out of interest: If I decide to pay extra money to be able to use the product I have already paid for the way it was intended, how long would the fixing take?

Quite disappointed greetings,
George

@georg.mahler

First of all, please accept our humble apologies for the inconvenience faced. We do realize severity of the issue and definitely value your concerns however, please note that we have been working to resolve reported issues, implement feature and enhancement requests as well as implement new features and functionalities to keep the API up to date.

We offer API revisions each month and in monthly cycle we deal with parallel queues of issues having different priorities (i.e. High priority queue in paid support and normal/low priority queue in free support). Using paid support option, it does not guarantee any immediate resolution but, it escalates the process of investigation and having an ETA. Additionally, issues logged under paid support model have precedence over the issues reported under normal/free support mode.

As shared earlier, in free support model, issues are resolved on first come first serve basis and further it also depends upon the nature of the issue if it is complex and depends upon how many internal components of the API. Sometimes, an enhancement which is already in process of implementation may cause some issues to be resolved as many issues could be depending upon same component of the API.

Above are all the serious reasons of not making any promises of reliable ETA when issue is queued in free support queue and recommending our customers to use paid support only in case where their issue is blocker and needs to be resolved on urgent basis.

We again apologize if any of our previous response caused misconception about our support policies. Furthermore, you may also please visit Free Support Policies for more details. Nevertheless, we have recorded your concerns and will definitely consider them while investigating the issue. Your cooperation and patience in this matter is greatly appreciated. Please spare us little time.

We are sorry for the inconvenience.

The issues you have found earlier (filed as PDFNET-45365) have been fixed in Aspose.PDF for .NET 18.12.