How can we know page number of docx files and can we restrict to certain 500 pages

We are using aspose words for comparison and we can see even small size files containing only text can contain around 10k pages which while sending to document comparison can restrict us for performance and memory side.
We need a mechanism where we can restrict the page number for docx files which won’t cause lot of latency to the service .

@abhishek.sonkar

To determine the page count of a DOCX file and restrict the number of pages for processing, you can utilize the Aspose.Words library. Here’s how you can achieve this:

  1. Get the Page Count: You can access the PageCount property of the Document class to retrieve the total number of pages in the document.

  2. Restrict to a Certain Number of Pages: If you want to limit the number of pages processed (for example, to 500 pages), you can extract only the desired range of pages using the ExtractPages method.

Here’s a sample code snippet demonstrating how to implement this:

using Aspose.Words;

class Program
{
    static void Main()
    {
        // Load the document
        Document doc = new Document("path/to/your/document.docx");

        // Get the total number of pages
        int totalPages = doc.PageCount;
        Console.WriteLine($"Total Pages: {totalPages}");

        // Define the maximum number of pages to process
        int maxPages = 500;
        int pagesToProcess = Math.Min(totalPages, maxPages);

        // Extract the pages
        Document extractedPages = doc.ExtractPages(0, pagesToProcess);
        
        // Save the extracted pages to a new document
        extractedPages.Save("path/to/save/extracted_pages.docx");
    }
}

Explanation:

  • The code first loads the DOCX file into a Document object.
  • It retrieves the total page count using doc.PageCount.
  • It then determines how many pages to process, ensuring it does not exceed the specified limit (500 in this case).
  • Finally, it extracts the pages from the document and saves them into a new DOCX file.

This approach helps manage performance and memory usage by limiting the number of pages sent for comparison or processing, thus reducing latency in your service.

If you have further questions or need additional assistance, feel free to ask!

@abhishek.sonkar As you may know MS Word documents are flow by their nature and there is no “page” concept. The consumer applications reflows the document’s content into pages on the fly. When you call Document.PageCount property, Aspose.Words builds the document layout, this is also quite resource consuming operation. Probably, in your case, you can limit the document size by paragraphs count. This will not require building document layout.

could you please help me how to achieve that @alexey.noskov

@abhishek.sonkar You can try using code like the following:

Document doc = new Document("C:\\Temp\\in.docx");
truncateDocument(doc);
doc.save("C:\\Temp\\out.docx");
private static void truncateDocument(Document doc)
{
    int maxParagraphs = 10;
    int count = 0;
    Paragraph lastKnownBodyPara = null;
        
    for (Paragraph para : (Iterable<Paragraph>)doc.getChildNodes(NodeType.PARAGRAPH, true))
    {
        // Truncate document only at the paragraph on the body level.
        if (para.getParentNode().getNodeType() == NodeType.BODY)
            lastKnownBodyPara = para;
            
        count++;
        if (count > maxParagraphs && lastKnownBodyPara != null)
        {
            // Remove section after the paragraph if any.
            Section sect = lastKnownBodyPara.getParentSection();
            while (sect.getNextSibling() != null)
                sect.getNextSibling().remove();
                
            // Remove nodes after the paragraph.
            while (lastKnownBodyPara.getNextSibling() != null)
                lastKnownBodyPara.getNextSibling().remove();
                
            break;
        }
    }
}

Hi @alexy.noskov I went through the docs and found out that when we call doc.getPageCount() it will try to render each page and get the actual page count, and we can also register PageLayoutEvent | Aspose.Words for Java when each page got render right ? So, I tried a method where, if I want to allow documents up to 500 pages only, I register this callback in validation and throw an error at the 501st page. This way, even if the document has 10K pages, we can stop rendering at the 501st page? will it really stop rendering when we throw an error in callback ?. Could you please check the code below and let me know if this is the correct approach or if it might cause any performance issues?

private static class RenderPageLayoutCallback implements IPageLayoutCallback {
	public void notify(PageLayoutCallbackArgs a) throws Exception {
		switch (a.getEvent()) {
			case PageLayoutEvent.PART_REFLOW_FINISHED:
				notifyPartFinished(a);
				break;
			case PageLayoutEvent.CONVERSION_FINISHED:
				break;
		}
	}

	private void notifyPartFinished(PageLayoutCallbackArgs a) throws Exception {
		System.out.println(MessageFormat.format("Part at page {0} reflow.", a.getPageIndex() + 1));
		if (a.getPageIndex() + 1 >= 500) {
			throw new Exception("Page limit reached");
		}
	}
}

public static void main() {
        try {
			Document docFile = new Document("./src/generate/doc_v1.docx");
			docFile.getLayoutOptions().setCallback(new RenderPageLayoutCallback());
			int noOfPages = docFile.getPageCount();
		} catch (Exception e) {
		   // catch validation error message here and communicate same to user	
		}
}

@Kldv Yes, you can use this approach. But still the approach will have worse performance that limiting flow document content size.

Hi @alexey.noskov thanks for your reply,

I am trying different things to validate the input document not contains more than 500 pages, if it has more than 500 pages I need to throw an error. so what I did was I used combination of LoadingProgressCallback and IPageLayoutCallback

my plan is

step1:
While loading the document, I set a time limit and check progress using LoadingProgressCallback. If it exceeds the time limit, I will throw an error to prevent loading large documents.

step2:
Even if document loaded within specified time limit, there may be a chance that loaded document will contains more than 500 pages, to check that I used IPageLayoutCallback example mentioned in How can we know page number of docx files and can we restrict to certain 500 pages - #6 by Kldv

the problem I am facing is that if the document contains many images, docFile.getPageCount()
taking more time even if doc contains ~20 pages, I thought of skipping image loading while getting the page count using IResourceLoadingCallback, but it is not being triggered in this process. Could you please review my code once to see if I am doing anything wrong? I know it is long thread but I am struck on this from past 2 days, and your perspective will be very much useful here.

source code:

public static void main(String[] args) throws Exception {
	loadAsposeLicense();
	try {
		LoadingProgressCallback progressCallback = new LoadingProgressCallback();
		HandleImageResourceLoading imageLoader = new HandleImageResourceLoading();
		LoadOptions loadOptions = new LoadOptions();
		loadOptions.setResourceLoadingCallback(imageLoader);
		loadOptions.setProgressCallback(progressCallback);
		long startTime = System.nanoTime();
		Document docFile = new Document("<test-doc>.docx", loadOptions);
		long endTime = System.nanoTime();
		long duration = endTime - startTime;
		System.out.println("Document loaded, took ms:" + duration / 1_000_000);
		docFile.getLayoutOptions().setCallback(new RenderPageLayoutCallback());
		System.out.println("Number of page :" + docFile.getPageCount());
	} catch (Exception ex) {
		System.out.println("Error while getting page count :" + ex.getMessage());
	}
}

public static class HandleImageResourceLoading implements IResourceLoadingCallback {
	@Override
	public int resourceLoading(ResourceLoadingArgs arg0) throws Exception {
		if (arg0.getResourceType() == ResourceType.IMAGE) {
			return ResourceLoadingAction.SKIP;
		}
		return ResourceLoadingAction.DEFAULT;
	}
}

public static class LoadingProgressCallback implements IDocumentLoadingCallback {
	public LoadingProgressCallback() {
		mLoadingStartedAt = new Date();
	}

	public void notify(DocumentLoadingArgs args) {
		Date canceledAt = new Date();
		long diff = canceledAt.getTime() - mLoadingStartedAt.getTime();
		long ellapsedSeconds = TimeUnit.MILLISECONDS.toSeconds(diff);
		if (ellapsedSeconds > MAX_DURATION)
			throw new IllegalStateException(MessageFormat.format("EstimatedProgress = {0}; CanceledAt = {1}", args.getEstimatedProgress(), canceledAt));
	}

	private final Date mLoadingStartedAt;
	private static final double MAX_DURATION = 0.3;
}

public static class RenderPageLayoutCallback implements IPageLayoutCallback {
	public void notify(PageLayoutCallbackArgs a) throws Exception {
		switch (a.getEvent()) {
			case PageLayoutEvent.PART_REFLOW_FINISHED:
				notifyPartFinished(a);
				break;
			case PageLayoutEvent.CONVERSION_FINISHED:
				break;
		}
	}

	private void notifyPartFinished(PageLayoutCallbackArgs a) throws Exception {
		if (a.getPageIndex() + 1 > 500) {
			throw new Exception("Page limit reached...");
		}
	}
}

file:

test1_doc_1000.docx (370.9 KB)

@abhishek.sonkar IResourceLoadingCallback is called only while loading external resources. In your case the image is embedded into the document, so expectedly the callback is not called.
You are right, building document layout for this document is slow. We will investigate what causes this.

We have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.

Issue ID(s): WORDSNET-27988

You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.