Docx Document taking too long for conversion

Hi,

I am testing the Aspose.Word for the Doc, Docx Document conversion.
But my testing got stucked into one document. It is taking too long for the conversion

It will be helpful if you provide me some solution for it.
It is taking time when we are using getPageCount() method

@RChilli_Nidhi Could you please attach your document here fir testing? We will check the issue and provide you more information.
Calling getPageCount() property forces Aspose.Words to build document layout, that is quire resource and time consuming operation especially if the document is large. The bigger document you are processing the more time is required.

Thanks for the reply!

Attaching a document for the same.
Please delete the document once tested.

Resume.docx (36.4 KB)

@RChilli_Nidhi What is your target format. I have tested conversion to PDF using the latest 22.7 version of Aspose.Words for Java and it took less than one second on my side. The following simple code was used for testing:

Document doc = new Document("C:\\Temp\\in.docx");
doc.save("C:\\Temp\\out.pdf");

I am using the older version of Aspose.Word 14

Is it working fine with that too?

Below is my code:

License license = new License();
InputStream streamLicense = licenceStream();
license.setLicense(streamLicense);
LoadOptions opts = new LoadOptions();
opts.setResourceLoadingCallback(new HandleResourceLoadingCallback());
doc = new Document(filedat, opts);
streamLicense.close();
filedat.close();
//getpages
pageCount = doc.getPageCount();

@RChilli_Nidhi 14 version of Aspose.Words was release about 8 years ago. There were a lot of fixes and improvements made during these years. So I would suggest you to use the latest version.
Also, please note that we do not provide hotfixes for old versions of Aspose.Words.

Thanks, is there a way to solve this using Aspose.Word 14?
As all other documents are working fine, only facing issues with this document.

Let me know if you have any solution

@RChilli_Nidhi I am afraid, we cannot help you to resolve this problem with old version of Aspose.Words. Could you please, check when your subscription expires, just open the license file in Notepad (but take care not to modify and save the license file or it will no longer work) and see the SubscriptionExpiry field.

<SubscriptionExpiry>20220218</SubscriptionExpiry>

It means that you can free upgrade to version of Aspose.Words published before 02/18/2022.
So you can update to more recent version and maybe the issue will not appear with it.

We have below subscription expiry date 16 May 2015
20140516

@RChilli_Nidhi With your license you can update to 14.4 version of Aspose.Words for Java, which was released May 4 2014.
I would suggest your to renew your subscription and update to the latest version of Aspose.Words for Java.

Hi,

I am doing testing with Aspose.Word 22.8. The Resume.docx file is parsing fine now with the latest aspose.

But two other resume is still taking too long to convert. Kindly check it.

1648786390257-7614b692-f582-4613-8a4f-a4d0c1bb6e57.docx (1.21 MB)

1657520490930-Randhir Pawar_IT PM-CIMB.docx (297 KB)

@RChilli_Nidhi Thank you for additional information.
It takes about 4 seconds to convert 1648786390257-7614b692-f582-4613-8a4f-a4d0c1bb6e57.docx to PDF and about 2 seconds to render 1657520490930-Randhir Pawar_IT PM-CIMB.docx.
I have checked conversion to PDF in MS Word and it also take approximately the same time. I have investigated the document, since it looks suspicious that documents with textual content has such big size. And I have found that in 1657520490930-Randhir Pawar_IT PM-CIMB.docx document there are strange shapes that represents horizontal lines on the first page. With these shapes size of document.xml file inside DOCX is about 4MB, after removing them size is about 98KB. Also conversion to PDF took less than 1 second:

Document doc = new Document("C:\\Temp\\1657520490930-Randhir Pawar_IT PM-CIMB.docx");
doc.getChildNodes(NodeType.GROUP_SHAPE, true).clear();
doc.save("C:\\Temp\\out.pdf");

You can note that this is too much for representing a simple horizontal line. So it looks like interpreting this shape takes the time.

The same problem is in the 1648786390257-7614b692-f582-4613-8a4f-a4d0c1bb6e57.docx document.

Thanks for the reply.

The shapes are required for us.
Additionally, sorry for the confusion we are not converting Doc/Docx to PDF. We are getting the text from the document.

Below is my code:

License license = new License();
InputStream streamLicense = licenceStream();
license.setLicense(streamLicense);
LoadOptions opts = new LoadOptions();
opts.setResourceLoadingCallback(new HandleResourceLoadingCallback());
doc = new Document(filedat, opts);
streamLicense.close();
filedat.close();
//getpages
pageCount = doc.getPageCount();

And for the shared documents, it is taking time at the last line i.e. doc.getPageCount()
Could you please check it?

@RChilli_Nidhi doc.getPageCount() performs document layout rebuild, the same is performed upon saving to PDF. I understand that the lines are required, but if they are inserted properly using one simple shape it will not take so much time to process them. It is too much that a simple horizontal line takes about 1MB in XML. When it can be represented with a simple line shape that takes several lines in XML:

<w:drawing>
	<wp:anchor distT="0" distB="0" distL="114300" distR="114300" simplePos="0" relativeHeight="251659264" behindDoc="0" locked="0" layoutInCell="1" allowOverlap="1" wp14:anchorId="10707A79" wp14:editId="7FA4C897">
		<wp:simplePos x="0" y="0"/>
		<wp:positionH relativeFrom="column">
			<wp:posOffset>38100</wp:posOffset>
		</wp:positionH>
		<wp:positionV relativeFrom="paragraph">
			<wp:posOffset>336550</wp:posOffset>
		</wp:positionV>
		<wp:extent cx="5905500" cy="38100"/>
		<wp:effectExtent l="0" t="0" r="19050" b="19050"/>
		<wp:wrapNone/>
		<wp:docPr id="1" name="Straight Connector 1"/>
		<wp:cNvGraphicFramePr/>
		<a:graphic xmlns:a="http://schemas.openxmlformats.org/drawingml/2006/main">
			<a:graphicData uri="http://schemas.microsoft.com/office/word/2010/wordprocessingShape">
				<wps:wsp>
					<wps:cNvCnPr/>
					<wps:spPr>
						<a:xfrm flipV="1">
							<a:off x="0" y="0"/>
							<a:ext cx="5905500" cy="38100"/>
						</a:xfrm>
						<a:prstGeom prst="line">
							<a:avLst/>
						</a:prstGeom>
					</wps:spPr>
					<wps:style>
						<a:lnRef idx="1">
							<a:schemeClr val="dk1"/>
						</a:lnRef>
						<a:fillRef idx="0">
							<a:schemeClr val="dk1"/>
						</a:fillRef>
						<a:effectRef idx="0">
							<a:schemeClr val="dk1"/>
						</a:effectRef>
						<a:fontRef idx="minor">
							<a:schemeClr val="tx1"/>
						</a:fontRef>
					</wps:style>
					<wps:bodyPr/>
				</wps:wsp>
			</a:graphicData>
		</a:graphic>
	</wp:anchor>
</w:drawing>

Or even less if use paragraph border:

<w:p w14:paraId="0B9C4D2A" w14:textId="62DA62A8" w:rsidR="002B2CA3" w:rsidRDefault="002B2CA3" w:rsidP="00685195">
	<w:pPr>
		<w:pBdr>
			<w:bottom w:val="single" w:sz="12" w:space="1" w:color="auto"/>
		</w:pBdr>
	</w:pPr>
</w:p>

So, if you have control over document creation avoid using complicated shapes to draw a simple horizontal lines in your document. This will improve performance in both MS Word and Aspose.Words.

Hi,

I am testing a few more documents using Aspose 22X, getting the invalid page count.
Attaching the document for your reference.

Can you please check it and let me know the issue?

(Attachment c36aa3f1c592ceedc34612efca6e965e.doc is missing)

(Attachment C.V.BLeinerDec2013c.doc is missing)

(Attachment c6g8iqek6zuier69.doc is missing)

CA_Resume_Analytics.docx (29.6 KB)

Hi,

I am testing a few more documents using Aspose 22X, getting the invalid
page count.
Attaching the document for your reference.

Can you please check it and let me know the issue?

File Name Aspose Page Count Actual Page Count
C.V.BLeinerDec2013c.doc 3 2
c36aa3f1c592ceedc34612efca6e965e.doc 3 2
c6g8iqek6zuier69.doc 4 3
CA_Resume_Analytics.docx 3 2

PageCount.zip (68.5 KB)

@RChilli_Nidhi I have checked CA_Resume_Analytics.docx and the returned page count is correct - 2 pages. Other documents were not been attached, Please zip them and attach the archive.
Also, please make sure you are using Aspose.Words in licensed mode. If you use Aspose.Words in evaluation mode, Aspose.Words injects evaluation message at the beginning of the document and this might lead to incorrect page number calculation.

Sharing other resumes, please share the insights !

PageCount.zip (68.5 KB)

@RChilli_Nidhi

  • C.V.BLeinerDec2013c.doc returns 2 pages
  • c6g8iqek6zuier69.doc returns 3 pages
  • c36aa3f1c592ceedc34612efca6e965e.doc returns 2 pages
  • CA_Resume_Analytics.docx returns 2 pages

Number of pages retuned by Aspose.Words is correct and matches the number of pages in MS Word. Here is code I have used for testing:

Document doc = new Document(@"C:\Temp\in.docx");
Console.WriteLine(doc.PageCount);

Please note that Aspose.Words requires to build document layout to calculate number of pages in the document. The fonts used in the documents are required to do this. If Aspose.Words cannot find the fonts used in the document, Aspose.Words substitutes the missed fonts. This might lead to layout differences and as a result incorrect page count. You can implement IWarnungCallback to get notification when font substitution is performed.

Yes got your point, is there a way to test the Aspose Document conversion
to verify whether it resolves our concerns or not!
Without getting the satisfactory result, we can move forward.