Corrupt DOCX file drives CPU usage to 100% when trying to Open

BancIntranets · November 6, 2018, 8:42pm

When uploading a .docx document on the 18.11.0 version of Aspose.Words for .Net I attempted to convert to a PDF using the code included below. However, the attached docx file brings down the server as it very quickly consumes 100% CPU usage and doesn’t return (we killed it after 30 minutes).

Aspose.Words.License license = new Aspose.Words.License();
license.SetLicense(licensePath);

//Enters this call and never returns!!!!!!!!!!!!!!!!!!!!!!
Aspose.Words.Document doc = new Aspose.Words.Document(objMemoryStream);

Aspose.Words.Saving.PdfSaveOptions saveOption = new Aspose.Words.Saving.PdfSaveOptions
{
    Compliance = Aspose.Words.Saving.PdfCompliance.PdfA1b,
    SaveFormat = Aspose.Words.SaveFormat.Pdf,
    MemoryOptimization = true
};

doc.Save(tempFilePath, saveOption);
objMemoryStream.Dispose();

awais.hafeez · November 7, 2018, 2:05am

@BancIntranets,

I am afraid, we do not see any attachments in this thread. If your file size is big then you may upload the ZIP file to Dropbox or any other file hosting service and share the download link here for testing.

BancIntranets · November 7, 2018, 1:11pm

https://drive.google.com/file/d/142slahowgnv8cdlhk0tuyzf2gb8vrald/view?usp=sharing

awais.hafeez · November 8, 2018, 1:18am

@BancIntranets,

We tested the scenario and managed to reproduce the System.ArgumentException exception during saving to PDF on our end. For the sake of correction, we have logged this problem in our issue tracking system. The ID of this issue is WORDSNET-17730. We will further look into the details of this problem and will keep you updated on the status of correction. We apologize for your inconvenience.

awais.hafeez · November 30, 2018, 4:56am

@BancIntranets,

Regarding WORDSNET-17730, it is to update you that we have completed the analysis of this issue and root cause as been identified. Please see the following analysis details:

It is a binary file that would not be opened even by using MS Word. It cannot be read or recovered.

As for detecting as text file - Aspose.Words tries to open it as a Unicode-encoded text, and it opens so is in MS Word after renaming its extension to txt. And it takes a long time in MS Word too.

It is hard to create some criteria for the correct Unicode-encoded text (it starts as ‘栾멭Ų藎溯첻輸✧’ continuing as an endless sequence of symbols) - so it is presented as a mix of characters of every possible language. And, as a mix of characters (in Unicode) it follows the criterion of low amount of special characters and Aspose.Words recognizes it as a text.

Actually, selecting a new criteria for Unicode-encoded files is possible but it would be better to check if the file is a correct .docx and then prevent from loading it.

Put simply, this file is not a DOCX document and is just a binary chunk. Its loading has no sense. Please share the reason as to why you are trying to load this file? If you just process documents from other party, maybe it is better to try to detect file format first and filter out invalid files?

awais.hafeez · January 17, 2019, 6:25am

@BancIntranets,

Regarding WORDSNET-17730, we have completed the work on your issue and come to a conclusion that we would not be able to implement the fix to your issue. Your issue (WORDSNET-17730) has now been closed with ‘Won’t Fix’ resolution. Please see my previous post for details.