JpegDevice.Process Sometimes Garbles Text

We have begun experiencing cases in which some of the text on a page of a PDF has its text garbled when we use JpegDevice.Process to convert it to an image. I have attached an example of a document for which most of the text on page 7 experiences the issue. The method which is called for each page of the document is provided below. In case it matters, we are always passing in a “scalePercent” of 70. Also, in case it matters, we are calling this method from within a Parallel.ForEach(…).


If you experience the same issue, perhaps you could suggest a change to my code that would address the issue such that all the text on all the page images would be correct?

Thank you very much.

(here’s my code)

public byte[] ProcessJPG(Aspose.Pdf.Page page, int scalePercent)
{
var imageConverter = new JpegDevice(new Resolution(200), 60);
using (var ms = new MemoryStream())
{
imageConverter.Process(page, ms);
var resized = new Bitmap(ms).ScaleByPercent(scalePercent);
return ImageToByteArray(resized, System.Drawing.Imaging.ImageFormat.Jpeg);
}
}

Hi,


Thank you for contacting support. There is no ScaleByPercent method in the Bitmap class. Kindly share the complete code of use case, including references of the assemblies. It will help us to replicate the same problem in our environment. We will investigate and share our findings with you. Your response is awaited.

We have converted source PDF pages to JPEG images with the latest version 17.5 of Aspose.Pdf for .NET API (without the call of ScaleByPercent method) and did not find any garbled text problem on page seven. We have attached an image of page seven to this reply.

I’ve found that the ScaleByPercent, an extension method created a couple years ago, is no longer necessary. As such, I’ve simply commented out this call, and repeated my test:


public byte[] ProcessJPG(Aspose.Pdf.Page page, int scalePercent)
{
var imageConverter = new JpegDevice(new Resolution(200), 60);
using (var ms = new MemoryStream())
{
imageConverter.Process(page, ms);
var resized = new Bitmap(ms); // .ScaleByPercent(scalePercent);
return ImageToByteArray(resized, System.Drawing.Imaging.ImageFormat.Jpeg);
}
}

Even with this method removed, I still get the occasional garbled text. I’ve also attached a second PDF that renders the first two pages completely garbled (will attach after posting this quick reply).

I thought it might also help to show you the calling methods, just in case there’s something in there that might be causing issues. I’ve listed the methods below in reverse order, starting with the method that calls the code above:

public byte[] ProcessJPG(Page page, int scalePercent)
{
return _pdfImages.ProcessJPG(page, scalePercent);
}

private void ProcessConvertedPage(DocumentMetaData documentMetaData, Page page, int pageNumber)
{
var firstImage = _documentConverter.ProcessJPG(page, 70);
StoreImageFile(documentMetaData, firstImage, pageNumber);
}

// relevant portion of a larger method
using (var document = new Aspose.Pdf.Document(new MemoryStream(documentMetaData.Original.Bytes)))
{
Parallel.ForEach(GetPdfPages(document, startPage, endPage), (pageTuple) =>
{
ProcessConvertedPage(documentMetaData, pageTuple.Item1, pageTuple.Item2);
});
}

// SUPPORTING method called from PORTION of larger method above
private IEnumerable<Tuple<Page, int>> GetPdfPages(Aspose.Pdf.Document document, int startPage, int endPage)
{
var maxPage = Math.Min(document.Pages.Count, endPage);
for (var i = startPage; i <= maxPage; i++)
{
yield return new Tuple<Page, int>(document.Pages[i], i);
}
}

Hopefully this additional call stack code will help in determining why I’m still getting garbled text on some pages.

Here is the additional file I referred to in my reply above.

Perhaps if it continues to work on your side, you could send me some sample code you’re using to make it work? I could compare that code with my entire call stack, and try to pinpoint any place I might be doing something different?


Thanks

I missed sending you the ImageToByteArray method called in the return statement of ProcessJPG:


public static byte[] ImageToByteArray(Image image, System.Drawing.Imaging.ImageFormat format)
{
using (var stream = new MemoryStream())
{
image.Save(stream, format);
return stream.ToArray();
}
}

I’ve also noticed that the text isn’t simply garbled junk. It actually seems like words from the PDF page are moved side to side a bit, causing words to lay overtop one another.

Thanks

Hi,


Thank you for the details. We are working over your query and will get back to you soon.

Hi,


Thank you for being patient. You have shared the source PDFs, please also highlight the garbled text with the help of the snapshot. We have converted your source PDF pages with the following code snippet:

[.NET, C#]
<span style=“font-family: “Courier New”, Consolas, Courier, monospace; font-size: small; white-space: pre; background-color: rgb(255, 255, 255);”>Document pdfDocument = <span class=“kwrd” style=“color: rgb(0, 0, 255); font-family: “Courier New”, Consolas, Courier, monospace; font-size: small; white-space: pre;”>new<span style=“font-family: “Courier New”, Consolas, Courier, monospace; font-size: small; white-space: pre; background-color: rgb(255, 255, 255);”> Document(@“C:\Pdf\test88\DotLoop+Example_UpdatedResidential+Resale+Real+Estate+Purchase+Contract+TRID+(AAR)+(1).pdf”);<span style=“font-family: “Courier New”, Consolas, Courier, monospace; font-size: small; white-space: pre; background-color: rgb(255, 255, 255);”>
<span class=“kwrd” style=“color: rgb(0, 0, 255); font-family: “Courier New”, Consolas, Courier, monospace; font-size: small; white-space: pre;”>for<span style=“font-family: “Courier New”, Consolas, Courier, monospace; font-size: small; white-space: pre; background-color: rgb(255, 255, 255);”> (<span class=“kwrd” style=“color: rgb(0, 0, 255); font-family: “Courier New”, Consolas, Courier, monospace; font-size: small; white-space: pre;”>int<span style=“font-family: “Courier New”, Consolas, Courier, monospace; font-size: small; white-space: pre; background-color: rgb(255, 255, 255);”> pageCount = 1; pageCount <= pdfDocument.Pages.Count; pageCount++)<span style=“font-family: “Courier New”, Consolas, Courier, monospace; font-size: small; white-space: pre; background-color: rgb(255, 255, 255);”>
{<span style=“font-family: “Courier New”, Consolas, Courier, monospace; font-size: small; white-space: pre; background-color: rgb(255, 255, 255);”>
<span class=“kwrd” style=“color: rgb(0, 0, 255); font-family: “Courier New”, Consolas, Courier, monospace; font-size: small; white-space: pre;”>using<span style=“font-family: “Courier New”, Consolas, Courier, monospace; font-size: small; white-space: pre; background-color: rgb(255, 255, 255);”> (FileStream imageStream = <span class=“kwrd” style=“color: rgb(0, 0, 255); font-family: “Courier New”, Consolas, Courier, monospace; font-size: small; white-space: pre;”>new<span style=“font-family: “Courier New”, Consolas, Courier, monospace; font-size: small; white-space: pre; background-color: rgb(255, 255, 255);”> FileStream(@“C:\Pdf\test88\image” + pageCount + “_out” + “.jpg”, FileMode.Create))<span style=“font-family: “Courier New”, Consolas, Courier, monospace; font-size: small; white-space: pre; background-color: rgb(255, 255, 255);”>
{<span style=“font-family: “Courier New”, Consolas, Courier, monospace; font-size: small; white-space: pre; background-color: rgb(255, 255, 255);”>
Resolution resolution = <span class=“kwrd” style=“color: rgb(0, 0, 255); font-family: “Courier New”, Consolas, Courier, monospace; font-size: small; white-space: pre;”>new<span style=“font-family: “Courier New”, Consolas, Courier, monospace; font-size: small; white-space: pre; background-color: rgb(255, 255, 255);”> Resolution(200);<span style=“font-family: “Courier New”, Consolas, Courier, monospace; font-size: small; white-space: pre; background-color: rgb(255, 255, 255);”>
JpegDevice jpegDevice = <span class=“kwrd” style=“color: rgb(0, 0, 255); font-family: “Courier New”, Consolas, Courier, monospace; font-size: small; white-space: pre;”>new<span style=“font-family: “Courier New”, Consolas, Courier, monospace; font-size: small; white-space: pre; background-color: rgb(255, 255, 255);”> JpegDevice(resolution, 60);<span style=“font-family: “Courier New”, Consolas, Courier, monospace; font-size: small; white-space: pre; background-color: rgb(255, 255, 255);”>
<span class=“rem” style=“color: rgb(0, 128, 0); font-family: “Courier New”, Consolas, Courier, monospace; font-size: small; white-space: pre;”>// Convert a particular page and save the image to stream<span style=“font-family: “Courier New”, Consolas, Courier, monospace; font-size: small; white-space: pre; background-color: rgb(255, 255, 255);”>
jpegDevice.Process(pdfDocument.Pages[pageCount], imageStream);<span style=“font-family: “Courier New”, Consolas, Courier, monospace; font-size: small; white-space: pre; background-color: rgb(255, 255, 255);”>
<span class=“rem” style=“color: rgb(0, 128, 0); font-family: “Courier New”, Consolas, Courier, monospace; font-size: small; white-space: pre;”>// Close stream<span style=“font-family: “Courier New”, Consolas, Courier, monospace; font-size: small; white-space: pre; background-color: rgb(255, 255, 255);”>
imageStream.Close();<span style=“font-family: “Courier New”, Consolas, Courier, monospace; font-size: small; white-space: pre; background-color: rgb(255, 255, 255);”>
}<br style=“font-family: “Courier New”, Consolas, Courier, monospace; font-size: small; white-space: pre;”><span style=“font-family: “Courier New”, Consolas, Courier, monospace; font-size: small; white-space: pre; background-color: rgb(255, 255, 255);”>}
<span style=“font-family: “Courier New”, Consolas, Courier, monospace; font-size: small; white-space: pre; background-color: rgb(255, 255, 255);”>
We have attached a Zip of output images to this reply. If this does not help, then please create a small project application which reproduces this problem in your environment, and then send us a Zip of this project. We will investigate and share our findings with you. In your provided code, DocumentMetaData class and StoreImageFile method are not defined.

Here’s a small snapshot of a portion of the first page of the document I attached yesterday:


Image 2017-06-07 at 9.20.09 AM

You can see all the letters, but it appears words are sitting on top of one another in the page image.

I’ll go through your code now and make sure mine matches as closely as possible…hopefully exactly.

Thank you

Also, I know it likely doesn’t help much, but I used your code to replace the code in my method, plus I added code to write out both the image of the page, as well as the page (both attached). Here’s my code with your code in it:


public byte[] ProcessJPG(Aspose.Pdf.Page page, int scalePercent)
{
Resolution resolution = new Resolution(200);
JpegDevice jpegDevice = new JpegDevice(resolution, 60);
using (var imageStream = new MemoryStream())
{
// temp
Document doc = new Document();
doc.Pages.Add(page);
doc.Save(“C:\temp\pages\page_” + page.Number + “.pdf”);
// temp

jpegDevice.Process(page, imageStream);

// temp
File.WriteAllBytes(“C:\temp\pages\page_” + page.Number + “.jpeg”, imageStream.ToArray());
// temp

return imageStream.ToArray();
}
}

When you look at the files created from the code (page_1.jpg and page_1.pdf), they show the problem I’m experiencing.

Seems like it has to be an environment difference. I’m running this code locally and stepping through it, so if there’s anything you can think of I should look at, let me know.

Thanks

Hi,


Thank you for sharing a snapshot. We can find garbled text in our output images as we shared in the earlier post (here). We have logged this issue under the ticket ID PDFNET-42879 in our bug tracking system. We have linked your post to this ticket and will keep you informed regarding any available updates. We are sorry for the inconvenience caused.

Would you have any estimate as to when this issue (PDFNET-42879) might be addressed?


Might there be a workaround I could try?

Thank you.

Hi,


Thank you for being patient. We have re-evaluated your ticket ID PDFNET-42879 with the latest version 17.6 of Aspose.Pdf for .NET API and could not find the garbled text in the output images. We have attached a Zip of output images to this reply. Kindly check and let us know if you can see garbled text in the output images. Your response is awaited.

I don’t see any attachments?

Hi,


We are sorry for missing an attachment. Please refer to this download link: OutputImages17.6Zip

I rebuilt my application with v17.6, but still get the same two pages of offset text. We’ll likely just switch to a different imaging tool, as there is obviously something different between our run environment (AWS EC2 / Beanstalk) and yours.


I appreciate all the assistance with this issue.

Hi,


Thank you for the confirmation. We have added your environment details to the ticket ID PDFNET-42879 in our bug tracking system and will notify you once it is fixed. We will be happy to assist you further in the future and thanks for checking out our API.