Hi, I am trying to split a 1652 page PDF based on a rule. My current implementation works fine with PDF files that have 120 pages. With my large PDF I get very long wait times when using the following code:
var document = new Document(inputStream); //use regex to split by rule var textFragmentAbsorber = new TextFragmentAbsorber( new Regex(splitRule, RegexOptions.IgnoreCase | RegexOptions.Multiline), new TextSearchOptions(true)); document.Pages.Accept(textFragmentAbsorber);
A rule looks like this, for example ^Test Is there any way to improve performance here?
To improve the performance of your PDF processing when using TextFragmentAbsorber with a regular expression, consider the following strategies:
Optimize the Regular Expression: Ensure that your regex pattern is as efficient as possible. For example, avoid using complex patterns that may lead to excessive backtracking. Simplifying the regex can significantly reduce processing time.
Limit the Scope of Search: Instead of processing the entire document at once, you can split the document into smaller sections or pages. This way, you can handle each section individually, which may help in reducing the load and improving performance.
Use TextSearchOptions Wisely: The TextSearchOptions parameter can be configured to optimize the search. For instance, if you do not need case-insensitive search or multiline matching, you can disable those options to speed up the process.
Parallel Processing: If your environment supports it, consider processing multiple pages in parallel. This can be done by using multi-threading to handle different pages simultaneously, which can significantly reduce the overall processing time.
Profile Your Code: Use profiling tools to identify bottlenecks in your code. This can help you understand which parts of the regex processing are taking the most time and allow you to focus your optimization efforts effectively.
Here is a sample code snippet that demonstrates how to set up the TextFragmentAbsorber with optimized options:
import com.aspose.pdf.Document;
import com.aspose.pdf.TextFragment;
import com.aspose.pdf.TextFragmentAbsorber;
import com.aspose.pdf.TextFragmentCollection;
import com.aspose.pdf.TextSearchOptions;
public class OptimizeTextSearch {
public static void main(String[] args) {
Document pdfDocument = new Document("largeDocument.pdf");
// Optimize regex pattern
String splitRule = "^Test"; // Example pattern
TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber(
splitRule, new TextSearchOptions(false)); // Disable case-insensitive and multiline options
// Process each page individually
for (int i = 1; i <= pdfDocument.getPages().size(); i++) {
pdfDocument.getPages().get_Item(i).accept(textFragmentAbsorber);
TextFragmentCollection textFragmentCollection = textFragmentAbsorber.getTextFragments();
// Process the found fragments as needed
}
}
}
By implementing these strategies, you should see an improvement in the performance of your PDF processing tasks.
Have you tried using 25.3 version of the API? If issue still persists, please share your sample PDF file with us along with some information how much time the API is taking to process it and what is your environment detail e.g. CPU, RAM, etc. We will further proceed accordingly.
Cookie Notice
To provide you with the best experience, we use cookies for personalization, analytics, and ads. By using our site, you agree to our cookie policy.
Enables storage, such as cookies, related to analytics.
Sets consent for sending user data to Google for online advertising purposes.
Sets consent for personalized advertising.
Cookie Notice
To provide you with the best experience, we use cookies for personalization, analytics, and ads. By using our site, you agree to our cookie policy.
More info
Enables storage, such as cookies, related to analytics.
Enables storage, such as cookies, related to advertising.
Sets consent for sending user data to Google for online advertising purposes.
Sets consent for personalized advertising.
Cookie Notice
To provide you with the best experience, we use cookies for personalization, analytics, and ads. By using our site, you agree to our cookie policy.
More info
Enables storage, such as cookies, related to analytics.
Enables storage, such as cookies, related to advertising.
Sets consent for sending user data to Google for online advertising purposes.