TextFragmentAbsorber with TextSearchOptions Regex slowly

mwc · March 21, 2025, 11:04am

Hi, I am trying to split a 1652 page PDF based on a rule. My current implementation works fine with PDF files that have 120 pages. With my large PDF I get very long wait times when using the following code:

var document = new Document(inputStream); //use regex to split by rule var textFragmentAbsorber = new TextFragmentAbsorber( new Regex(splitRule, RegexOptions.IgnoreCase | RegexOptions.Multiline), new TextSearchOptions(true)); document.Pages.Accept(textFragmentAbsorber);
A rule looks like this, for example ^Test Is there any way to improve performance here?

Professionalize.Discourse · March 21, 2025, 11:04am

@mwc

To improve the performance of your PDF processing when using TextFragmentAbsorber with a regular expression, consider the following strategies:

Optimize the Regular Expression: Ensure that your regex pattern is as efficient as possible. For example, avoid using complex patterns that may lead to excessive backtracking. Simplifying the regex can significantly reduce processing time.
Limit the Scope of Search: Instead of processing the entire document at once, you can split the document into smaller sections or pages. This way, you can handle each section individually, which may help in reducing the load and improving performance.
Use TextSearchOptions Wisely: The TextSearchOptions parameter can be configured to optimize the search. For instance, if you do not need case-insensitive search or multiline matching, you can disable those options to speed up the process.
Parallel Processing: If your environment supports it, consider processing multiple pages in parallel. This can be done by using multi-threading to handle different pages simultaneously, which can significantly reduce the overall processing time.
Profile Your Code: Use profiling tools to identify bottlenecks in your code. This can help you understand which parts of the regex processing are taking the most time and allow you to focus your optimization efforts effectively.

Here is a sample code snippet that demonstrates how to set up the TextFragmentAbsorber with optimized options:

import com.aspose.pdf.Document;
import com.aspose.pdf.TextFragment;
import com.aspose.pdf.TextFragmentAbsorber;
import com.aspose.pdf.TextFragmentCollection;
import com.aspose.pdf.TextSearchOptions;

public class OptimizeTextSearch {
    public static void main(String[] args) {
        Document pdfDocument = new Document("largeDocument.pdf");
        
        // Optimize regex pattern
        String splitRule = "^Test"; // Example pattern
        TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber(
            splitRule, new TextSearchOptions(false)); // Disable case-insensitive and multiline options

        // Process each page individually
        for (int i = 1; i <= pdfDocument.getPages().size(); i++) {
            pdfDocument.getPages().get_Item(i).accept(textFragmentAbsorber);
            TextFragmentCollection textFragmentCollection = textFragmentAbsorber.getTextFragments();
            // Process the found fragments as needed
        }
    }
}

By implementing these strategies, you should see an improvement in the performance of your PDF processing tasks.

mwc · March 21, 2025, 11:17am

Your suggetions do not work for me and set

public TextSearchOptions(bool isRegularExpressionUsed);

to false does not sound right to me since i want to use regex for this kind of work.

asad.ali · March 21, 2025, 6:34pm

@mwc

Have you tried using 25.3 version of the API? If issue still persists, please share your sample PDF file with us along with some information how much time the API is taking to process it and what is your environment detail e.g. CPU, RAM, etc. We will further proceed accordingly.