We are currently using the property doc.ExtractPages(i, 1);
to implement custom logic for removing blank pages from documents. However, we are facing a performance issue when processing large documents.
The document contains over 2,000 pages. While the extraction process is relatively fast for the initial 100 to 300 pages, the performance significantly degrades beyond that point, with the overall execution time reaching nearly 12 hours.
We would appreciate any suggestions or alternative approaches—such as using Word-based methods or other optimization techniques—to improve the speed and efficiency of blank page removal.
@Nagasrinu
Cause
The performance issue you’re experiencing with the doc.ExtractPages(i, 1);
method when processing large documents is likely due to the overhead of extracting pages one at a time. This inefficiency becomes more pronounced as the document size increases, leading to significant delays, especially when dealing with thousands of pages.
Solution
To improve the speed and efficiency of blank page removal, consider the following approaches:
-
Batch Processing: Instead of extracting pages one by one, identify and remove multiple blank pages in a single operation. This reduces the overhead associated with multiple method calls.
-
Use DocumentBuilder: Utilize the DocumentBuilder
class to traverse through the document and check for blank pages. Maintain a list of pages to keep and then create a new document with only the non-blank pages.
-
Optimize Page Check Logic: Ensure that your logic for determining whether a page is blank is efficient. For example, check for specific conditions that define a blank page (like no text, images, or other content) before deciding to remove it.
-
Parallel Processing: If your environment allows, consider processing pages in parallel to take advantage of multi-threading, which can significantly speed up the operation.
Code
Here’s a simplified example using DocumentBuilder
to check for blank pages:
Document doc = new Document("input.docx");
DocumentBuilder builder = new DocumentBuilder(doc);
List<Integer> pagesToKeep = new ArrayList<>();
for (int i = 0; i < doc.getPageCount(); i++) {
builder.moveToPage(i);
if (!isPageBlank(builder)) {
pagesToKeep.add(i);
}
}
// Create a new document with only the non-blank pages
Document newDoc = new Document();
for (int pageIndex : pagesToKeep) {
newDoc.appendDocument(doc.extractPages(pageIndex, 1), ImportFormatMode.KeepSourceFormatting);
}
newDoc.save("output.docx");
Next Steps
- Implement the batch processing or
DocumentBuilder
approach in your code.
- Test the performance with a document of similar size to evaluate improvements.
- If performance issues persist, profile your code to identify bottlenecks.
Please let me know if you need further assistance or clarification!
@Nagasrinu As you may know MS Word document are flow by their nature, so there is no “page” concept. The document content is reflowed to page on the fly by the consumer applications. The same does Aspose.Words - to extract pages document layout should be built. This is quite complex and resource consuming task and time for building document layout is increasing not linearly depending on the document size. So unfortunately, there is no way to make building document layout faster for huge MS Word documents.
Have you tried using a built-in blank age removal method Document.RemoveBlanPages?
Hi,
I am using 23.1.0 word version not showing [Document.RemoveBlanPages] option, please provide which version added this property.