We are using aspose words for comparison and we can see even small size files containing only text can contain around 10k pages which while sending to document comparison can restrict us for performance and memory side.
We need a mechanism where we can restrict the page number for docx files which won’t cause lot of latency to the service .
To determine the page count of a DOCX file and restrict the number of pages for processing, you can utilize the Aspose.Words library. Here’s how you can achieve this:
-
Get the Page Count: You can access the
PageCount
property of theDocument
class to retrieve the total number of pages in the document. -
Restrict to a Certain Number of Pages: If you want to limit the number of pages processed (for example, to 500 pages), you can extract only the desired range of pages using the
ExtractPages
method.
Here’s a sample code snippet demonstrating how to implement this:
using Aspose.Words;
class Program
{
static void Main()
{
// Load the document
Document doc = new Document("path/to/your/document.docx");
// Get the total number of pages
int totalPages = doc.PageCount;
Console.WriteLine($"Total Pages: {totalPages}");
// Define the maximum number of pages to process
int maxPages = 500;
int pagesToProcess = Math.Min(totalPages, maxPages);
// Extract the pages
Document extractedPages = doc.ExtractPages(0, pagesToProcess);
// Save the extracted pages to a new document
extractedPages.Save("path/to/save/extracted_pages.docx");
}
}
Explanation:
- The code first loads the DOCX file into a
Document
object. - It retrieves the total page count using
doc.PageCount
. - It then determines how many pages to process, ensuring it does not exceed the specified limit (500 in this case).
- Finally, it extracts the pages from the document and saves them into a new DOCX file.
This approach helps manage performance and memory usage by limiting the number of pages sent for comparison or processing, thus reducing latency in your service.
If you have further questions or need additional assistance, feel free to ask!
@abhishek.sonkar As you may know MS Word documents are flow by their nature and there is no “page” concept. The consumer applications reflows the document’s content into pages on the fly. When you call Document.PageCount
property, Aspose.Words builds the document layout, this is also quite resource consuming operation. Probably, in your case, you can limit the document size by paragraphs count. This will not require building document layout.
@abhishek.sonkar You can try using code like the following:
Document doc = new Document("C:\\Temp\\in.docx");
truncateDocument(doc);
doc.save("C:\\Temp\\out.docx");
private static void truncateDocument(Document doc)
{
int maxParagraphs = 10;
int count = 0;
Paragraph lastKnownBodyPara = null;
for (Paragraph para : (Iterable<Paragraph>)doc.getChildNodes(NodeType.PARAGRAPH, true))
{
// Truncate document only at the paragraph on the body level.
if (para.getParentNode().getNodeType() == NodeType.BODY)
lastKnownBodyPara = para;
count++;
if (count > maxParagraphs && lastKnownBodyPara != null)
{
// Remove section after the paragraph if any.
Section sect = lastKnownBodyPara.getParentSection();
while (sect.getNextSibling() != null)
sect.getNextSibling().remove();
// Remove nodes after the paragraph.
while (lastKnownBodyPara.getNextSibling() != null)
lastKnownBodyPara.getNextSibling().remove();
break;
}
}
}
Hi @alexy.noskov I went through the docs and found out that when we call doc.getPageCount()
it will try to render each page and get the actual page count, and we can also register PageLayoutEvent | Aspose.Words for Java when each page got render right ? So, I tried a method where, if I want to allow documents up to 500 pages only, I register this callback in validation and throw an error at the 501st page. This way, even if the document has 10K pages, we can stop rendering at the 501st page? will it really stop rendering when we throw an error in callback ?. Could you please check the code below and let me know if this is the correct approach or if it might cause any performance issues?
private static class RenderPageLayoutCallback implements IPageLayoutCallback {
public void notify(PageLayoutCallbackArgs a) throws Exception {
switch (a.getEvent()) {
case PageLayoutEvent.PART_REFLOW_FINISHED:
notifyPartFinished(a);
break;
case PageLayoutEvent.CONVERSION_FINISHED:
break;
}
}
private void notifyPartFinished(PageLayoutCallbackArgs a) throws Exception {
System.out.println(MessageFormat.format("Part at page {0} reflow.", a.getPageIndex() + 1));
if (a.getPageIndex() + 1 >= 500) {
throw new Exception("Page limit reached");
}
}
}
public static void main() {
try {
Document docFile = new Document("./src/generate/doc_v1.docx");
docFile.getLayoutOptions().setCallback(new RenderPageLayoutCallback());
int noOfPages = docFile.getPageCount();
} catch (Exception e) {
// catch validation error message here and communicate same to user
}
}
@Kldv Yes, you can use this approach. But still the approach will have worse performance that limiting flow document content size.
Hi @alexey.noskov thanks for your reply,
I am trying different things to validate the input document not contains more than 500 pages, if it has more than 500 pages I need to throw an error. so what I did was I used combination of LoadingProgressCallback and IPageLayoutCallback
my plan is
step1:
While loading the document, I set a time limit and check progress using LoadingProgressCallback. If it exceeds the time limit, I will throw an error to prevent loading large documents.
step2:
Even if document loaded within specified time limit, there may be a chance that loaded document will contains more than 500 pages, to check that I used IPageLayoutCallback example mentioned in How can we know page number of docx files and can we restrict to certain 500 pages - #6 by Kldv
the problem I am facing is that if the document contains many images, docFile.getPageCount()
taking more time even if doc contains ~20 pages, I thought of skipping image loading while getting the page count using IResourceLoadingCallback, but it is not being triggered in this process. Could you please review my code once to see if I am doing anything wrong? I know it is long thread but I am struck on this from past 2 days, and your perspective will be very much useful here.
source code:
public static void main(String[] args) throws Exception {
loadAsposeLicense();
try {
LoadingProgressCallback progressCallback = new LoadingProgressCallback();
HandleImageResourceLoading imageLoader = new HandleImageResourceLoading();
LoadOptions loadOptions = new LoadOptions();
loadOptions.setResourceLoadingCallback(imageLoader);
loadOptions.setProgressCallback(progressCallback);
long startTime = System.nanoTime();
Document docFile = new Document("<test-doc>.docx", loadOptions);
long endTime = System.nanoTime();
long duration = endTime - startTime;
System.out.println("Document loaded, took ms:" + duration / 1_000_000);
docFile.getLayoutOptions().setCallback(new RenderPageLayoutCallback());
System.out.println("Number of page :" + docFile.getPageCount());
} catch (Exception ex) {
System.out.println("Error while getting page count :" + ex.getMessage());
}
}
public static class HandleImageResourceLoading implements IResourceLoadingCallback {
@Override
public int resourceLoading(ResourceLoadingArgs arg0) throws Exception {
if (arg0.getResourceType() == ResourceType.IMAGE) {
return ResourceLoadingAction.SKIP;
}
return ResourceLoadingAction.DEFAULT;
}
}
public static class LoadingProgressCallback implements IDocumentLoadingCallback {
public LoadingProgressCallback() {
mLoadingStartedAt = new Date();
}
public void notify(DocumentLoadingArgs args) {
Date canceledAt = new Date();
long diff = canceledAt.getTime() - mLoadingStartedAt.getTime();
long ellapsedSeconds = TimeUnit.MILLISECONDS.toSeconds(diff);
if (ellapsedSeconds > MAX_DURATION)
throw new IllegalStateException(MessageFormat.format("EstimatedProgress = {0}; CanceledAt = {1}", args.getEstimatedProgress(), canceledAt));
}
private final Date mLoadingStartedAt;
private static final double MAX_DURATION = 0.3;
}
public static class RenderPageLayoutCallback implements IPageLayoutCallback {
public void notify(PageLayoutCallbackArgs a) throws Exception {
switch (a.getEvent()) {
case PageLayoutEvent.PART_REFLOW_FINISHED:
notifyPartFinished(a);
break;
case PageLayoutEvent.CONVERSION_FINISHED:
break;
}
}
private void notifyPartFinished(PageLayoutCallbackArgs a) throws Exception {
if (a.getPageIndex() + 1 > 500) {
throw new Exception("Page limit reached...");
}
}
}
file:
test1_doc_1000.docx (370.9 KB)
@abhishek.sonkar IResourceLoadingCallback
is called only while loading external resources. In your case the image is embedded into the document, so expectedly the callback is not called.
You are right, building document layout for this document is slow. We will investigate what causes this.
We have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.
Issue ID(s): WORDSNET-27988
You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.