Split larger document using bytes c#

pkumar1826 · January 3, 2025, 4:57pm

Hi,

We are currently using Trial version of “Aspose.Total for .NET” and trying to split a document using memory stream contents of size 80MB and I see Aspose is taking lot of time and some times just gets stuck in processing.
Could you please help us on how do we do this effectively because our client wants us to split the document even if document size is more than 300mb .

thanks

Professionalize.Discourse · January 3, 2025, 4:58pm

@pkumar1826

To effectively split a large document using Aspose.Words for .NET, especially when dealing with large file sizes (like 300MB or more), you can utilize the ExtractPages method or split by sections or headings. Here’s a general approach you can follow:

Splitting a Document by Pages

If you want to split the document page by page, you can use the ExtractPages method. Here’s a code example demonstrating how to do this:

using Aspose.Words;

Document doc = new Document("path/to/your/document.docx");
int pageCount = doc.PageCount;

for (int i = 0; i < pageCount; i++)
{
    Document pageDoc = doc.ExtractPages(i, 1);
    pageDoc.Save($"output/page_{i + 1}.docx");
}

Splitting a Document by Sections

If your document is structured with sections, you can split it based on those sections. Here’s how you can do that:

using Aspose.Words;

Document doc = new Document("path/to/your/document.docx");
for (int i = 0; i < doc.Sections.Count; i++)
{
    Document sectionDoc = new Document();
    sectionDoc.AppendChild(sectionDoc.ImportNode(doc.Sections[i], true));
    sectionDoc.Save($"output/section_{i + 1}.docx");
}

Tips for Handling Large Documents

Memory Management: Ensure that your application has sufficient memory allocated, as processing large documents can be memory-intensive.
Performance Optimization: Consider processing the document in smaller chunks or using asynchronous processing if applicable.
Testing: Test with smaller documents first to ensure your logic works before scaling up to larger files.

By following these methods, you should be able to split large documents effectively without getting stuck during processing. If you encounter performance issues, consider optimizing your code or reviewing the memory usage of your application.

For more detailed information, you can refer to the official documentation on splitting documents using Aspose.Words here.

Sources:
[1]: Split a Document in C#|Aspose.Words for .NET - Aspose Documentation

amjad.sahi · January 3, 2025, 5:50pm

@pkumar1826,

If you continue to experience any performance problems, please share sample documents and code snippets. We will evaluate and look into your issue soon.

pkumar1826 · January 3, 2025, 6:06pm

Thanks for quick reply.

We are trying to split only Pdf documents for now.

Here is the code we are using.

public async Task SplitDocumentByPageRangeAsync(Stream inputStream, string fileExt, string pageRange)
{
try
{

            fileExt = FormatFileExtension(fileExt);

            inputStream = await Convert2PdfAsync(inputStream, fileExt);

            var processedDocument = new MemoryStream();

            await Task.Run(() =>
            {
                using (var memoryStream = new MemoryStream())
                {
                    inputStream.CopyTo(memoryStream);
                    memoryStream.Position = 0;
                    using (var pdfDocument = new Aspose.Pdf.Document(memoryStream))
                    {
                        var pageRangeArray = pageRange.Split(',');
                        var extractedPdf = new Aspose.Pdf.Document();

                        foreach (var range in pageRangeArray)
                        {
                            var pageNumbers = range.Split('-');
                            int startPage = int.Parse(pageNumbers[0]);
                            int endPage = pageNumbers.Length > 1 ? int.Parse(pageNumbers[1]) : startPage;

                            if (startPage < 1 || endPage > pdfDocument.Pages.Count)
                            {
                                throw new ArgumentOutOfRangeException($"Page range {range} is out of bounds.");
                            }

                            for (int i = startPage; i <= endPage; i++)
                            {
                                extractedPdf.Pages.Add(pdfDocument.Pages[i]);
                            }
                        }

                        extractedPdf.Save(processedDocument);
                    }
                }
            });

            processedDocument.Position = 0;
            return processedDocument;
        }
        catch (Exception ex)
        {
            throw new InvalidOperationException("Failed to split the document by page range asynchronously.", ex);
        }
    }

amjad.sahi · January 3, 2025, 6:13pm

@pkumar1826,

Thank you for sharing the code snippet.

I am moving your thread to the appropriate forum, where a member of our Aspose.PDF team will review your issue and provide assistance as needed soon.

asad.ali · January 3, 2025, 10:29pm

@pkumar1826

If possible, as requested earlier - can you please also share the sample document for our reference? We will test the scenario in our environment and address it accordingly.

pkumar1826 · January 6, 2025, 10:40am

Sorry document is having some confidential information, I cannot share it.

But can you please look at the code and correct us if needed.

asad.ali · January 6, 2025, 2:25pm

@pkumar1826

The code snippet looks fine. However, you can please try using FileStream instead of MemoryStream because MemoryStream has some limitations in C# in terms of size. You can also try saving the file to local disk temporarily and remove them once whole process is done.