- So i am trying to extract a table out of my single paged OCR PDF which has a Table and some other paragraphs with Aspose OCR and Aspose PDF
- Below is one code sample
[HttpPost("read-pdf-aspose")]
public async Task<StandardResponse<string>> ReadPdfAspose([FromBody] PdfFilePathRequest request)
{
try
{
if (string.IsNullOrWhiteSpace(request.PdfFilePath))
{
return new StandardResponse<string>
{
Status = false,
Message = "PDF file path cannot be empty."
};
}
// Create temporary directory for images
string tempPath = Path.Combine(Path.GetTempPath(), Guid.NewGuid().ToString());
Directory.CreateDirectory(tempPath);
try
{
// Load PDF document
using (Document pdfDocument = new Document(request.PdfFilePath))
{
// Initialize OCR engine
Aspose.OCR.AsposeOcr recognitionEngine = new Aspose.OCR.AsposeOcr();
StringBuilder extractedText = new StringBuilder();
// Process each page
for (int pageIndex = 1; pageIndex <= pdfDocument.Pages.Count; pageIndex++)
{
var page = pdfDocument.Pages[pageIndex];
// Save images from the page
for (int imgIndex = 1; imgIndex <= page.Resources.Images.Count; imgIndex++)
{
string imagePath = Path.Combine(tempPath, $"page_{pageIndex}_img_{imgIndex}.png");
// Extract and save the image
using (FileStream imageStream = new(imagePath, FileMode.Create))
{
page.Resources.Images[imgIndex].Save(imageStream);
}
// Set up OCR for table detection
Aspose.OCR.OcrInput input = new Aspose.OCR.OcrInput(Aspose.OCR.InputType.SingleImage);
input.Add(imagePath);
// Configure to detect tables
Aspose.OCR.RecognitionSettings recognitionSettings = new Aspose.OCR.RecognitionSettings();
recognitionSettings.DetectAreasMode = Aspose.OCR.DetectAreasMode.TABLE;
recognitionSettings.LinesFiltration = true;
// Perform recognition
Aspose.OCR.OcrOutput results = recognitionEngine.Recognize(input, recognitionSettings);
// Collect recognized text
extractedText.AppendLine($"--- Table content from Page {pageIndex}, Image {imgIndex} ---");
foreach (Aspose.OCR.RecognitionResult result in results)
{
extractedText.AppendLine(result.RecognitionText);
}
extractedText.AppendLine();
}
}
return new StandardResponse<string>
{
Status = true,
Message = "Tables extracted successfully from PDF using Aspose",
Data = extractedText.ToString()
};
}
}
finally
{
// Cleanup: Delete temporary directory and files
if (Directory.Exists(tempPath))
{
Directory.Delete(tempPath, true);
}
}
}
catch (Exception ex)
{
_logger.LogError(ex, "Error processing PDF with Aspose");
return new StandardResponse<string>
{
Status = false,
Message = ex.Message
};
}
}
-
Here I am sending the attachment file path ( attachment below ) and i am getting the response
table and text content pdf.pdf (138.2 KB) -
The response i am receiving here is
{
"status": true,
"message": "Tables extracted successfully from PDF using Aspose",
"data": "--- Table content from Page 1, Image 1 ---\r\nThe Main Table required for evaluation header\nFirst Header Second Header Third Header Fourth Header\nFirst row First First row Second First row Third First row Fourth\ncolumn Sample column Sample column Sample column Sample\ndata data data data\nSecond row First Second row Second row Third Second row Fourth\ncolumn Sample Second colum column Sample column Sample\ndata Sample data data data\nThird row First Third row Second Third row Third Third row Fourth\ncolumn Sample column Sample column Sample column Sample\ndata data\nSample text for differentiation between Table data and outer data\nIpsum has been the industry's standard dummy text ever since the 1500s,when an\nunknown printer took a galley of type and scrambled it to make a type specimen book. It\nhas survived not only five centuries, but also the leap into electronic typesetting, remaining\nessentially\r\n\r\n"
}
- So in the code, if i comment out the line
recognitionSettings.DetectAreasMode = Aspose.OCR.DetectAreasMode.TABLE;
- The response i am getting is
{
"status": true,
"message": "Tables extracted successfully from PDF using Aspose",
"data": "--- Table content from Page 1, Image 1 ---\r\nThe Main Table required for evaluation header\nFirst Header Second Header Third Header Fourth Header\nFirst row First First row Second First row Third First row Fourth\ncolumn Sample column Sample column Sample column Sample\ndata data data data\nSecond row First Second row Second row Third Second row Fourth\ncolumn Sample Second column column Sample column Sample\ndata Sample data data data\nThird row First Third row Second Third row Third Third row Fourth\ncolumn Sample column Sample column Sample column Sample\ndata data\nSample text for differentiation between Table data and outer data\nIpsum has been the industry's standard dummy text ever since the 1500s,when an\nunknown printer took a galley of type and scrambled it to make a type specimen book.It\nhas survived not only five centuries, but also the leap into electronic typesetting, remaining\nessentially\r\n\r\n"
}
- So as we can see that is not making any difference. How can i extract just the Table out of my OCR with Aspose.OCR ?
- Is there any method which can recognize the Tables specifically and return me just the Tables and its contents ?
- For example So my requirement here is it should extract all the text and the tables preserving their positions
-Suppose in one page there are 4 lines of text, one table with content and another two lines of text after that
-Now the code should be able to extract the all the data , four lines in the begining, table content with the some functionality or Tags of any such as
content