I am evaluating Apso.ocr. I need to take a flattened pdf file and process a page with grids on it and turn that image of a grid into actual grids. I need to do this across may files and may structures.
The code In have is hung at the ocr.Recognize line of code. It runs and runs for over 10 minutes, no error code, no time-out…I have no idea what is wrong. I have coded this a few different ways, using memorystream and pdf file. Please help so I can evaluate this product.
If there is a better way to get tables from a flattened pdf file I am open to all ideas you have for accuracy and processing speed.
I can already process not flattened pdf files, so if there were a way to turn a flattened file into a data pdf file that would work as well.
Dim results As List(Of Aspose.OCR.RecognitionResult) = ocr.Recognize(input, settings)
Gets hung. No matter how I try to use it.
Here is the code I am running, and it gets hung on that line
The pdf file attacthed is a one page file with some grids on it. The file is flattened, I need to read grids on flattened files. I cant evaluate your ocr api if I cant use it.
It keeps getting hung on the results line. It just runs ans runs no errors no time out, no reason…any help would be appreciated.
Public Function ConvertFlattenedPdfPageToCsvString(pdfPath As String, pageNumber As Integer, Optional dpi As Integer = 300) As String
' 1) Render the page to PNG bytes
Dim pngBytes As Byte() = RenderPageToPng(pdfPath, pageNumber, dpi)
' 2) Build OCR input from the PNG stream
Dim ocr As New Aspose.OCR.AsposeOcr()
Dim settings As New Aspose.OCR.RecognitionSettings() With {
.DetectAreasMode = Aspose.OCR.DetectAreasMode.TABLE
}
Using ms As New MemoryStream(pngBytes)
Dim input As New Aspose.OCR.OcrInput(Aspose.OCR.InputType.SingleImage)
ms.Position = 0
input.Add(ms)
' 3) Run OCR (returns List(Of RecognitionResult)); take the first result
Dim results As List(Of Aspose.OCR.RecognitionResult) = ocr.Recognize(input, settings)
If results Is Nothing OrElse results.Count = 0 Then
Throw New InvalidOperationException("OCR returned no results.")
End If
Dim result As Aspose.OCR.RecognitionResult = results(0)
' 4) Save OCR result to XLSX in-memory
Using xlsxStream As New MemoryStream()
result.Save(xlsxStream, Aspose.OCR.SaveFormat.Xlsx)
xlsxStream.Position = 0
' 5) Use Aspose.Cells to convert XLSX -> CSV (UTF-8, quoted)
Dim wb As New Workbook(xlsxStream)
Dim csvOpts As New TxtSaveOptions(Aspose.Cells.SaveFormat.Csv) With {
.Separator = ","c,
.Encoding = System.Text.Encoding.UTF8,
.AlwaysQuoted = True
}
Using csvStream As New MemoryStream()
wb.Save(csvStream, csvOpts)
Return System.Text.Encoding.UTF8.GetString(csvStream.ToArray())
End Using
End Using
End Using
End Function
'--- Private: Render a single PDF page to PNG bytes at the specified DPI
Private Function RenderPageToPng(pdfPath As String,
pageNumber As Integer,
dpi As Integer) As Byte()
Using doc As New Document(pdfPath)
If pageNumber < 1 OrElse pageNumber > doc.Pages.Count Then
Throw New ArgumentOutOfRangeException(NameOf(pageNumber),
$"Page {pageNumber} is out of range. Document has {doc.Pages.Count} pages.")
End If
Dim res As New Resolution(dpi)
Dim device As New PngDevice(res)
Using outMs As New MemoryStream()
device.Process(doc.Pages(pageNumber), outMs)
Return outMs.ToArray()
End Using
End Using
End Function
Best Regards,
Mase Woods
MaseW@bidtracer.com
480.734.5077