Using OCR on an image generated from a PDF returns a nonsensical result

mrawesome · May 11, 2015, 3:42pm

Not sure where to post this or if I should make 2 posts. Currently we are trying to convert PDFs to CSV. Some PDFs convert without any problems, however, we have some that just return an empty string (Problem 1). As a work around we have attempted to convert the PDF into a jpeg then run OCR. The OCR is returning a random string (Problem 2). Any help would be appreciated. Thanks in advance.

Code:

public HttpResponseMessage GetText(string path)

{

var fileInfo = new FileInfo(path);

if (!fileInfo.Exists)

{

Log.Warning(string.Format("Method: GetImage(string). File not found : {0}", path));

return null;//this.NotFound();

}

//open document

var pdfDocument = new Document(path);

var excelsave = new ExcelSaveOptions { MinimizeTheNumberOfWorksheets = true };

var exceldoc = new Aspose.Cells.Workbook();

using (var stream = new MemoryStream())

{

pdfDocument.Flatten();

pdfDocument.Save(stream, excelsave);

exceldoc = new Aspose.Cells.Workbook(stream);

}

if (!IsPdf(exceldoc))

exceldoc = OcrPdf(pdfDocument);// throw new Exception("FIle cannot be parsed.");

using (var returnStream = new MemoryStream())

{

exceldoc.Save(returnStream, Aspose.Cells.SaveFormat.CSV);

returnStream.Seek(0, 0);

var result = new HttpResponseMessage(HttpStatusCode.OK)

{

Content = new ByteArrayContent(returnStream.GetBuffer())

};

result.Content.Headers.ContentDisposition = new System.Net.Http.Headers.ContentDispositionHeaderValue("attachment")

{

FileName = "my.csv"

};

result.Content.Headers.ContentType = new MediaTypeHeaderValue("application/octet-stream");

return result;

}

private Workbook OcrPdf(Document pdfDocument)

{

var ocr = new OcrEngine();

var sb = new StringBuilder();

for (int pageCount = 1; pageCount <= pdfDocument.Pages.Count; pageCount++)

{

using (var imageStream = new MemoryStream())

{

var resolution = new Resolution(300);

var jpegDevice = new JpegDevice(Convert.ToInt32(pdfDocument.Pages[pageCount].PageInfo.Width), Convert.ToInt32(pdfDocument.Pages[pageCount].PageInfo.Height), resolution, 100);

jpegDevice.Process(pdfDocument.Pages[pageCount], imageStream);

ocr.Image = ImageStream.FromMemoryStream(imageStream, ImageStreamFormat.Jpg);

if (ocr.Process())

{

sb.Append(ocr.Text);

sb.Append(Environment.NewLine);

}

imageStream.Close();

}

return null;

}

private bool IsPdf(Workbook exceldoc)

{

return exceldoc.Worksheets[0].Cells[0, 0].Value != null;

}

OCR result of first page:

",,Ny,NMANNNNT!%3%:TTEW,phipp:VNNWNNNNT}:n-npW#%#?%N\r\n"

babar.raza · May 12, 2015, 1:00am

Hi Robert,

Thank you for contacting Aspose support.

It would be appropriate to create a new thread in Aspose.Total support forum for the Problem 1 as it involves two Aspose APIs, that are; Aspose.Pdf & Aspose.Cells. We will look into the Problem 2 here as this thread is currently in Aspose.OCR support forum. However, we need the source PDF file(s) to thoroughly analyze the scenario to pin point the problem cause so please share a few problematic samples here for our testing.

mrawesome · May 12, 2015, 7:24am

Thanks for the quick response! I will send a document directly to you.

babar.raza · May 12, 2015, 11:01am

Hi Robert,

Thank you for sharing the sample PDF. I have reviewed the sample and I think the images embedded in PDF file have very poor quality. For instance, the character boundaries are wavy for most of the contents. Anyway, we will perform a few test with these samples, and share out results here for your reference. In the meanwhile, could you please specify if you wish to extract all the textual contents are a specific part of it?

mrawesome · May 12, 2015, 11:08am

Thanks for your help on this. We would need all of the text content. When generating an image from the pdf the contents look more blurry, even with the resolution set to 300.

ikram.haq · May 13, 2015, 3:01am

Hi Robert,

We have carried out the investigation of the issue raised by you in the following manner:

Exercise #1:

1. Saving one page of the PDF manually using Adobe

2. Perform OCR

Exercise #2:

1. Read PDF document page by page.

2. Convert each page into an image, without any special setting.

3. Perform OCR on each image.

Exercise #3:

1. Read PDF document page by page.

2. Convert each page into an image, with special setting like:

    <div><div class="csharpcode">
    var resolution = new Aspose.Pdf.Devices.Resolution(300);

var jpegDevice = new JpegDevice(Convert.ToInt32(pdfDocument.Pages[pageCount].PageInfo.Width),

Convert.ToInt32(pdfDocument.Pages[pageCount].PageInfo.Height), resolution, 100);

<p class="MsoListParagraph" style="margin-left:.75in;mso-add-space:auto;

text-indent:-.25in;mso-list:l1 level1 lfo3">

3.<span style=“font:7.0pt “Times New Roman””>
Perform OCR on each image.

It has been observed that Exercise #2 generated the much accurate results though
the results are still not up to the mark. Please, note that the issue has been
logged in our issue tracking system with ID OCR-34045.

We will update you accordingly. We truly appreciate your support and
understanding.

mrawesome · May 13, 2015, 7:13am

Can you elaborate on Exercise #2 a bit? What do you mean “without any special setting”?

Do you mean the following?

var resolution = new Aspose.Pdf.Devices.Resolution(300);
var jpegDevice = new JpegDevice(resolution, 100);

ikram.haq · May 13, 2015, 12:15pm

Hi Robert,

"without any special setting" means without setting any horizontal, vertical resolution/ DPI value and without any quality value.

Following is the code sample for your consideration:

var sb = new StringBuilder();
Document pdfDocument = new Document(@"C:\Files\test1.pdf");
XImage xImage = pdfDocument.Pages[1].Resources.Images[1]; FileStream outputImage = new FileStream(@"C:\Files\output.jpg", FileMode.Create);
xImage.Save(outputImage, ImageFormat.Jpeg);
outputImage.Close();
OcrEngine ocrEngine = new OcrEngine();
ocrEngine.Image = ImageStream.FromFile(@"C:\Files\output.jpg");
if (ocrEngine.Process())
{
sb.Append(ocrEngine.Text);
sb.Append(Environment.NewLine);
}

awais.hafeez · March 29, 2018, 5:23am

The issues you have found earlier (filed as ) have been fixed in this Aspose.Words for JasperReports 18.3 update.