PDF to HTML on Linux “out of memory” exception for large PDF

Hello I am using Aspose.Pdf v23.8 to convert PDF document pages to HTML. It works good for small sizes pdf. But when I convert large pdf more than 2MB itself it is failing in Linux. Works in window till 10MB. The program runs on two OS - Windows Server and Linux
On Windows it works but slow for large File Size (may be coversion time)
On Linux, no operation is performed and I’m receiving an error message:
#=zn4CQaS8b34$LPJ21k6xhFz0= : #=zPifGal1Qtm2zYhxug64QbiU= : #=zPifGal1Qtm2zYhxug64QbiU= : #=zPifGal1Qtm2zYhxug64QbiU= : #=zV8CdtxWzJM3JXcfNGXTcCOQ= : Out of memory.[ at System.Drawing.Region.Intersect(Region region)\n at #=z0dax3_$wJq$ay9D2uO08ciS5ZJ5j3_mbgQ==.#=zhYFs4CmG0_rk$PeqDcjOQg4=.#=zggRXows=(#=zV8CdtxWzJM3JXcfNGXTcCOQ= #=z41Bi5l0=)\n at #=z0dax3_$wJq$ay9D2uO08ciS5ZJ5j3_mbgQ==.#=zhYFs4CmG0_rk$PeqDcjOQg4=.#=zZojGFcI=(#=zV8CdtxWzJM3JXcfNGXTcCOQ= #=z41Bi5l0=)\n at #=zV8CdtxWzJM3JXcfNGXTcCOQ=.#=z$nuGS2E=(#=zk$YpfyUoovnag5g7Z7OdNAw57IvC #=z1GrA8Rg=)\n at #=z1eqRsEJ1x4JRYUayM2_sieWpLa1I.#=z$nuGS2E=(#=zk$YpfyUoovnag5g7Z7OdNAw57IvC #=z1GrA8Rg=)][ at #=z1eqRsEJ1x4JRYUayM2_sieWpLa1I.#=z$nuGS2E=(#=zk$YpfyUoovnag5g7Z7OdNAw57IvC #=z1GrA8Rg=)\n at #=zPifGal1Qtm2zYhxug64QbiU=.#=z$nuGS2E=(#=zk$YpfyUoovnag5g7Z7OdNAw57IvC #=z1GrA8Rg=)\n at #=z1eqRsEJ1x4JRYUayM2_sieWpLa1I.#=z$nuGS2E=(#=zk$YpfyUoovnag5g7Z7OdNAw57IvC #=z1GrA8Rg=)][ at #=z1eqRsEJ1x4JRYUayM2_sieWpLa1I.#=z$nuGS2E=(#=zk$YpfyUoovnag5g7Z7OdNAw57IvC #=z1GrA8Rg=)\n at #=zPifGal1Qtm2zYhxug64QbiU=.#=z$nuGS2E=(#=zk$YpfyUoovnag5g7Z7OdNAw57IvC #=z1GrA8Rg=)\n at #=z1eqRsEJ1x4JRYUayM2_sieWpLa1I.#=z$nuGS2E=(#=zk$YpfyUoovnag5g7Z7OdNAw57IvC #=z1GrA8Rg=)][ at #=z1eqRsEJ1x4JRYUayM2_sieWpLa1I.#=z$nuGS2E=(#=zk$YpfyUoovnag5g7Z7OdNAw57IvC #=z1GrA8Rg=)\n at #=zPifGal1Qtm2zYhxug64QbiU=.#=z$nuGS2E=(#=zk$YpfyUoovnag5g7Z7OdNAw57IvC #=z1GrA8Rg=)\n at #=z1eqRsEJ1x4JRYUayM2_sieWpLa1I.#=z$nuGS2E=(#=zk$YpfyUoovnag5g7Z7OdNAw57IvC #=z1GrA8Rg=)][ at #=z1eqRsEJ1x4JRYUayM2_sieWpLa1I.#=z$nuGS2E=(#=zk$YpfyUoovnag5g7Z7OdNAw57IvC #=z1GrA8Rg=)\n at #=zn4CQaS8b34$LPJ21k6xhFz0=.#=z$nuGS2E=(#=zk$YpfyUoovnag5g7Z7OdNAw57IvC #=z1GrA8Rg=)\n at #=z1eqRsEJ1x4JRYUayM2_sieWpLa1I.#=z$nuGS2E=(#=zk$YpfyUoovnag5g7Z7OdNAw57IvC #=z1GrA8Rg=)]
Inner Ex:
StackTrace: at #=z9rO4fxU0VRZIaPnXAMPxKJ8=.#=zh_2V_iM=()\n at #=zXkvUUfH2nLhfz0YFoz8H5YQ=.#=z0vw6Q1fGarfYoxDSHQ==(Document #=zf7Xyum8=, #=zYb23bGC_zi7C #=z6fSixRs5hs7G, String #=zXW3y5VjWu0cbKgqvRA==, Stream #=zoaqXAMkBnjw$, HtmlSaveOptions #=zGTQStMo=)\n at #=zXkvUUfH2nLhfz0YFoz8H5YQ=.#=zu9cODDg=(Document #=zf7Xyum8=, String #=zXW3y5VjWu0cbKgqvRA==, Stream #=zoaqXAMkBnjw$, HtmlSaveOptions #=zGTQStMo=)\n at Aspose.Pdf.Document.#=zEf57q6V1seTN(Stream #=zu5ITKWRjt_FT, SaveOptions #=zGTQStMo=)\n at Aspose.Pdf.Document.Save(Stream outputStream, SaveOptions options)\n

What is the RAM size required for 10MB file in LINUX?.Is there any alternative to fix the issue

What does the encoded error message states?

Below is the code Snippet being used to convert the PDF to single page HTML.
var pdf = new Document(inputFile);
var conversionOptions = new HtmlSaveOptions(HtmlDocumentType.Html5, true)
{
PartsEmbeddingMode = PartsEmbeddingModes.EmbedAllIntoHtml,
LettersPositioningMethod = LettersPositioningMethods.UseEmUnitsAndCompensationOfRoundingErrorsInCss,
RasterImagesSavingMode = RasterImagesSavingModes.AsEmbeddedPartsOfPngPageBackground
};
// To save as stream
using (var outStream = System.IO.File.Create(outputFilePath))
{
pdf.Save(outStream, conversionOptions);
outStream.Position = 0;
var reader = new StreamReader(outStream);
var convertedHtml = reader.ReadToEnd();
inputFile.Close();
outStream.Close();
response = convertedHtml;
}

I tried with other option like memory stream. Below is the code snippet.
MemoryStream outStream = new();
var pdf = new Document(new MemoryStream(convertRequest.FileData));
var conversionOptions = new HtmlSaveOptions(HtmlDocumentType.Html5, true)
{
PartsEmbeddingMode = PartsEmbeddingModes.EmbedAllIntoHtml,
LettersPositioningMethod = LettersPositioningMethods.UseEmUnitsAndCompensationOfRoundingErrorsInCss,
RasterImagesSavingMode = RasterImagesSavingModes.AsEmbeddedPartsOfPngPageBackground
};
// To save as stream
pdf.Save(outStream, conversionOptions);
outStream.Position = 0;
var reader = new StreamReader(outStream);
var convertedHtml = reader.ReadToEnd();
response = convertedHtml;
} But am facing same Out Of Memory issue. Please help me on how to resolve it.

@Vijayalakshmisridharan

Instead of Aspose.PDF for .NET, would you please try to use Aspose.Pdf.Drawing as it is built to work in both Windows and Linux like environments without a dependency upon System.Drawing. You can uninstall Aspose.PDF from the project and install Aspose.Pdf.Drawing instead. In case issue keeps persisting, please make sure that all windows fonts are also install in the system e.g. msttcorefonts package.

Please share your sample document with us in case non of the above suggestions work. We will further proceed accordingly.

1.Yes, using Aspose.PDF.Drawing resolved the issue. But it is taking so long to convert the HTML. Is there any way to optimize.
2.Can i know what is the difference between Aspose.PDF vs Aspose.PDF.Drawing.
3.Will the total licence be enough for Aspose.PDF.Drawing package also?

@Vijayalakshmisridharan

Can you please share your sample PDF with us and the information of time taken by the API? We will log an investigation ticket and share the ID with you. Please also share the complete environment details.

Aspose.Pdf.Drawing is similar to Aspose.PDF except it does not have dependency on System.Drawing. It uses Aspose.Drawing for it. It was launched as beta version to test specifically in non-Windows environments like Linux. We will make it permanent part of Aspose.PDF once our testing is complete.

Meanwhile, you will be able to keep using it with your existing license as well as code because it will function similar to the Aspose.PDF for .NET.

Hi,

When will Aspose.PDF.Drawing expected to be in Production. Is there any Expected Date to complete the testing phase.

Given sample file of 9.6 MB got converted to 18.9 MB (HTML) in 6.4 mins. Please let me know if you have any options to optimize the conversion time. I have updated the logic which we are currently using to convert PDF to HTML in c#.

Environment details:

Linux OS

containerCpu: 4GB,

containerMemory: 16GB

relx-2022-annual-report.zip (6.06 MB)

@Vijayalakshmisridharan

Sadly, we do not have any timeline defined for the goal. The work is in progress and we will announce once it is merged with Aspose.PDF for .NET. Meanwhile, as shared earlier - you can keep using it for as long as you need.

Would you kindly share the updated code snippet for our reference as well? We will log an investigation task with updated information and share the ID with you.

Please find the below logic being used for converting PDF TO HTML

// Get the file as IFormFile and convert to byte[]
byte[] data;
using (var ms = new MemoryStream())
{
fileUploadRequest.File?.CopyTo(ms);
data = ms.ToArray();
}
convertRequest.FileData = data;

MemoryStream outStream = new();
var pdf = new Document(new MemoryStream(convertRequest.FileData));
var conversionOptions = new HtmlSaveOptions(HtmlDocumentType.Html5, true)
{
PartsEmbeddingMode = PartsEmbeddingModes.EmbedAllIntoHtml,
LettersPositioningMethod = LettersPositioningMethods.UseEmUnitsAndCompensationOfRoundingErrorsInCss,
RasterImagesSavingMode = RasterImagesSavingModes.AsEmbeddedPartsOfPngPageBackground
};
pdf.Save(outStream, conversionOptions);
outStream.Position = 0;
var reader = new StreamReader(outStream);
var convertedHtml = reader.ReadToEnd();
outStream.Close();
response = convertedHtml;
fileStatus.ConvertedHTML = convertedHtml;

// saving it as string for rendering

@Vijayalakshmisridharan

We have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.

Issue ID(s): PDFNET-55907

You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.