Aspose.Words.FileCorruptedException : The document appears to be corrupted and cannot be loaded

Hi,

I’m seeing the following when attempting to convert from .docx to PDF using C#:

Aspose.Words.FileCorruptedException: The document appears to be corrupted and cannot be loaded.

The same logic works fine with an older Word format (.doc) file.

I’m using Aspose.PDF 23.3.1 and Aspose.Words 23.4.0 NuGet packages.

The docx file itself contains just a couple of words of text.

Inner exception as follows:

    Aspose.Words.FileCorruptedException
      HResult=0x80131500
      Message=The document appears to be corrupted and cannot be loaded.
      Source=Aspose.Words
      StackTrace:
       at Aspose.Words.Document.(Stream , LoadOptions )
       at Aspose.Words.Document.(Stream , LoadOptions )
       at Aspose.Words.Document..ctor(Stream stream, LoadOptions loadOptions)
    [solution-specific stack trace clipped]

      This exception was originally thrown at this call stack:
        [External Code]

    Inner Exception 1:
       :   ZipEntry::ReadDirEntry(): Bad signature (0x00000000) at position 0x291BFFFC

Have tried multiple .docx files to avoid the problem being the file actually being corrupted.

Using Office 2016.

The code is throwing an exception as follows:

var wordDocFromMemoryStream = new Aspose.Words.Document(new MemoryStream(wordDoc),
    new LoadOptions()
    {
        Encoding = Encoding.UTF8,
        LoadFormat = LoadFormat.Docx,
        MswVersion = MsWordVersion.Word2016

    });

I get the same error with no LoadOptions passed in.

Any help greatly appreciated!

Thanks,

Will

@WillJames can you please attach your source document (docx, doc), and since you are using 2 APIs a full code example.

Hi,

Thanks for taking the time to reply. I have updated the post with a code sample. It’s the Aspose.Words.Document instantiation that’s failing.

test2.docx (11.1 KB)

@WillJames Sorry I cannot reproduce your issue. This is the code that I used (I tried with a simple conversion to PDF):

FileStream f = File.OpenRead("C:\\Temp\\input.docx");

Document doc = new Document(f, new LoadOptions
{
    LoadFormat = LoadFormat.Docx,
    WarningCallback = new ConversionIssueCallBack(),
    MswVersion = MsWordVersion.Word2016,
    Encoding = Encoding.UTF8,
});

var opt = new PdfSaveOptions()
{
    SaveFormat = SaveFormat.Pdf,
};

doc.Save("C:\\Temp\\output.pdf", opt);

Thank you for trying. I am trying to process uploads via a website rather than a file system operation.

The nUnit integration test I have to mimic this behavior looks like this. It requires a “test2.docx” Word doc in the same folder as the Test, with properties of that document set to Embedded Resource / Copy Always (via Visual Studio solution explorer > right click file > Properties):

[Test]
public void Can_convert_word_doc()
{
    byte[] wordDoc;
    var assembly = Assembly.GetExecutingAssembly();
    var resourceNames = assembly.GetManifestResourceNames();
    string resourceName = resourceNames.Single(str => str.EndsWith("test2.docx"));

    using (Stream stream = assembly.GetManifestResourceStream(resourceName))
    {
        using (StreamReader reader = new StreamReader(stream))
        {
            var result = reader.ReadToEnd();
            wordDoc = Encoding.UTF8.GetBytes(result);
        }
    }

    var pdf = new PDF();
    pdf.Name = "Foo";
    pdf.AddWord(wordDoc);

    var output = pdf.FinalFile();
    Assert.That(output, Is.Not.Null);
}

And the AddWord(parameter) method is as follows:

public void AddWord(byte[] wordDoc)
{
    var wordDocFromMemoryStream = new Aspose.Words.Document(new MemoryStream(wordDoc),
        new LoadOptions()
        {
            Encoding = Encoding.UTF8,
            LoadFormat = LoadFormat.Docx,
            MswVersion = MsWordVersion.Word2016

        });

    var pdfMemoryStream = new MemoryStream();

    wordDocFromMemoryStream.Save(pdfMemoryStream, SaveFormat.Pdf);

    //ref https://docs.aspose.com/pdf/net/change-page-size/
    var pdfDocument = new Aspose.Pdf.Document(pdfMemoryStream);
    var pageCollection = pdfDocument.Pages;
    var pdfPage = pageCollection[1];

    //in points; 72 points = 1 inch. std letter is 11x8.5"
    const int standardLetterHeightInPoints = 792;
    const int standardLetterWidthInPoints = 612;

    pdfPage.SetPageSize(standardLetterHeightInPoints, standardLetterWidthInPoints);
    pdfDocument.Save(pdfMemoryStream);

    Add(pdfMemoryStream.ToArray());
}

The test currently fails at the point of calling the AddWord method as previously described.

@WillJames the way that you use to convert Stream into byte array (byte[]) is not correct, please try this code instead:

byte[] wordDoc;
using (Stream f = File.OpenRead("C:\\Temp\\input.docx"))
{
    wordDoc = ReadFully(f);
}

MemoryStream docStream = new MemoryStream(wordDoc);

Document doc = new Document(docStream, new LoadOptions
{
    LoadFormat = LoadFormat.Docx,
    WarningCallback = new ConversionIssueCallBack(),
    MswVersion = MsWordVersion.Word2016,
    Encoding = Encoding.UTF8,
});

var opt = new PdfSaveOptions()
{
    SaveFormat = SaveFormat.Pdf,
};

doc.Save("C:\\Temp\\output.pdf", opt);
public static byte[] ReadFully(Stream input)
{
    byte[] buffer = new byte[input.Length];
    using (MemoryStream ms = new MemoryStream())
    {
        int read;
        while ((read = input.Read(buffer, 0, buffer.Length)) > 0)
        {
            ms.Write(buffer, 0, read);
        }
        return ms.ToArray();
    }
}

Additionally you can use the Stream to create the word document directly with no need to convert it in a byte array:

using (Stream g = File.OpenRead("C:\\Temp\\input.docx"))
{
    Document doc = new Document(g, new LoadOptions
    {
        LoadFormat = LoadFormat.Docx,
        WarningCallback = new ConversionIssueCallBack(),
        MswVersion = MsWordVersion.Word2016,
        Encoding = Encoding.UTF8,
    });

    var opt = new PdfSaveOptions()
    {
        SaveFormat = SaveFormat.Pdf,
    };

    doc.Save("C:\\Temp\\output.pdf", opt);
}