I have an issue where finding the PageCount of a document can take a very long time for documents with many pages. In testing we found that it is not the document file size, but it is the number of pages which causes the delay. A docx which is 5MB and 360 pages takes a few seconds to find the PageCount, but a docx which is 1MB and 7,000 pages can take over 1 minute to find the PageCount.
Are there plans to improve the performance of this?
If I use OpenXML to find the PageCount on either the 360 page document or the 7,000 page document it takes around 1 second instead of a few seconds or a minute.
We would like to use Aspose in our solution, maybe there are other calls we can make which can provide better performance to find the number of pages on a document?
Thanks for your inquiry. Please note that performance and memory usage all depend on complexity and size of the documents you are generating. While rendering a document to fixed page formats (e.g. PDF), Aspose.Words needs to build two model in the memory – one for document and the other for rendered document.
The process of building layout model is not linear; it may take a minute to render one page and may take a few seconds to render 100 pages. Also, Aspose.Words has to create APS (Aspose Page Specification) model in memory and this may again eat some more time for some documents. Rest assured, we’re always working on improving performance; but, rendering will be always running slower than simple saving to flow formats (e.g. doc/docx).
Document.PageCount invokes page layout which builds the document in memory so note that with large documents this property can take time. After invoking this property, any rendering operation e.g rendering to PDF or image will be instantaneous.
Could you please share your input document along with OpenXML here for our reference? We will investigate the issue on our side and provide you more information.
You apparently do not allow uploading .cs files so below is example code pasted with timers
September 20 I tried to update the formatting of this to make a bit less unclear. Apparently the only way to get this to look formatted like it would in notepad++ or visual studio is to add a bunch of html tags. So, sorry that the formatting cannot be clearly read on the forum.
using System;
using System.IO;
using System.Diagnostics;
using DocumentFormat.OpenXml.Packaging;
using Aspose.Words;
namespace AsposeUpload
{
class Program
{
///
/// Path to input word document
///
private static string _myFilePath = CurrentExecutingDirectory + "MyFile.docx"; // - 7011 pages
///
/// Aspose Words license
///
private static License asposeLicense;
static void Main(string[] args)
{
// IMPORTANT: A VALID LICENSE IS REQUIRED OTHERWISE ASPOSE.PAGECOUNT RETURNS A NUMBER BASED ON A TRUNCATED DOCUMENT!
// THE LICENSE HAS BEEN REMOVED SO A NEW LICENSE WILL NEED TO BE ADDED TO GET CORRECT PAGE COUNT!
asposeLicense = new License();
Console.WriteLine("Rough timing metrics to compare operations between Aspose and OpenXML\n");
// - Get page count using Open XML
int openXMLPageCount = getPageCountOpenXML();
Console.WriteLine("OpenXML pageCount = {0}", openXMLPageCount);
Console.WriteLine();
// - Get page count using Aspose
int asposePageCount = getPageCountAspose();
Console.WriteLine("Aspose pageCount = {0}", asposePageCount);
Console.ReadKey();
}
///
/// Get page count using aspose
///
private static int getPageCountAspose()
{
int asposePageNumber = 0;
using (FileStream fileStream = new FileStream(_myFilePath, FileMode.OpenOrCreate))
{
Stopwatch asposeDocTimer = Stopwatch.StartNew();
Document asposeDoc = new Document(fileStream); //takes a few seconds
asposeDocTimer.Stop();
Console.WriteLine("Loading stream into Aspose document took " + asposeDocTimer.ElapsedMilliseconds + "ms");
// - Time the amount of time it takes Aspose to calculate the page count
Stopwatch asposePageCountTimer = Stopwatch.StartNew();
asposePageNumber = asposeDoc.PageCount; //when a valid license is used this takes a long time.
asposePageCountTimer.Stop();
Console.WriteLine("Getting page count via Aspose took " + asposePageCountTimer.ElapsedMilliseconds + "ms");
}
return asposePageNumber;
}
///
/// Get page count using Open XML
///
private static int getPageCountOpenXML()
{
int openXMLPageNumber = 0;
using (FileStream fileStream = new FileStream(_myFilePath, FileMode.OpenOrCreate))
{
Stopwatch openXMLDocTimer = Stopwatch.StartNew();
using (WordprocessingDocument document = WordprocessingDocument.Open(fileStream, false))
{
openXMLDocTimer.Stop();
Console.WriteLine("Loading stream into OpenXML document took " + openXMLDocTimer.ElapsedMilliseconds + "ms");
// - Time the amount of time it takes OpenXML to calcualte the page count
Stopwatch openXMLtimer = Stopwatch.StartNew();
openXMLPageNumber = int.Parse(document.ExtendedFilePropertiesPart.Properties.Pages.Text);
openXMLtimer.Stop();
Console.WriteLine("Getting page count via OpenXML took " + openXMLtimer.ElapsedMilliseconds + "ms");
}
}
return openXMLPageNumber;
}
private static string CurrentExecutingDirectory
{
get
{
if (!string.IsNullOrEmpty(AppDomain.CurrentDomain.RelativeSearchPath))
{
// When running as an `ASP.NET` process this will have a value
// to the application’s sub "private" folder (the "bin" directory)
return AppDomain.CurrentDomain.RelativeSearchPath + @"";
}
// In case this isn’t running from a web application
return AppDomain.CurrentDomain.BaseDirectory;
}
}
}
}
Thanks for sharing the detail. You are getting the page count value from document’s properties using OpenXml. You can get the same value using Aspose.Words. Please use Document.BuiltInDocumentProperties.Pages property to get the page count of document. Aspose.Words updates this property when you call Document.UpdatePageLayout.
Document.PageCount invokes page layout which builds the document in memory so note that with large documents this property can take time. After invoking this property, any rendering operation e.g rendering to PDF or image will be instantaneous.
The comment on aspose’s Document.BuiltInDocumentProperties.Pages states
“Represents an estimate of the number of pages in the document.”
Which indicates the number may be wrong.
Additionally, looking online there are posts about openXML returning the incorrect page number, so if both are getting the data the same way they would both be wrong.
Does this mean I need to choose between having a very long delay (aspose’s Document.pageCount takes over 20 seconds on the document I provided) versus potentially incorrect data(aspose’s Document.BuiltInDocumentProperties.Pages is an estimate)?
Thanks for your inquiry. You are getting the expected behavior of Aspose.Words. As you are getting the page count’s value from document’s properties using OpenXml so it is possible that the page count value might be wrong. Please get the value of page count using OpenXml for attached document. It will return “1” which is incorrect.
In your case, we suggest you please use Document.PageCount to get the page count of document.