How can i extract the Table with a special tags along with the rest of the

izzmekaif · March 5, 2025, 5:54pm

So i am trying to extract a table out of my single paged OCR PDF which has a Table and some other paragraphs with Aspose OCR and Aspose PDF
Below is one code sample

[HttpPost("read-pdf-aspose")]
        public async Task<StandardResponse<string>> ReadPdfAspose([FromBody] PdfFilePathRequest request)
        {
            try
            {
                if (string.IsNullOrWhiteSpace(request.PdfFilePath))
                {
                    return new StandardResponse<string>
                    {
                        Status = false,
                        Message = "PDF file path cannot be empty."
                    };
                }

                // Create temporary directory for images
                string tempPath = Path.Combine(Path.GetTempPath(), Guid.NewGuid().ToString());
                Directory.CreateDirectory(tempPath);

                try
                {
                    // Load PDF document
                    using (Document pdfDocument = new Document(request.PdfFilePath))
                    {
                        // Initialize OCR engine
                        Aspose.OCR.AsposeOcr recognitionEngine = new Aspose.OCR.AsposeOcr();
                        StringBuilder extractedText = new StringBuilder();

                        // Process each page
                        for (int pageIndex = 1; pageIndex <= pdfDocument.Pages.Count; pageIndex++)
                        {
                            var page = pdfDocument.Pages[pageIndex];

                            // Save images from the page
                            for (int imgIndex = 1; imgIndex <= page.Resources.Images.Count; imgIndex++)
                            {
                                string imagePath = Path.Combine(tempPath, $"page_{pageIndex}_img_{imgIndex}.png");

                                // Extract and save the image
                                using (FileStream imageStream = new(imagePath, FileMode.Create))
                                {
                                    page.Resources.Images[imgIndex].Save(imageStream);
                                }

                                // Set up OCR for table detection
                                Aspose.OCR.OcrInput input = new Aspose.OCR.OcrInput(Aspose.OCR.InputType.SingleImage);
                                input.Add(imagePath);

                                // Configure to detect tables
                                Aspose.OCR.RecognitionSettings recognitionSettings = new Aspose.OCR.RecognitionSettings();
                                recognitionSettings.DetectAreasMode = Aspose.OCR.DetectAreasMode.TABLE;
                                recognitionSettings.LinesFiltration = true;

                                // Perform recognition
                                Aspose.OCR.OcrOutput results = recognitionEngine.Recognize(input, recognitionSettings);

                                // Collect recognized text
                                extractedText.AppendLine($"--- Table content from Page {pageIndex}, Image {imgIndex} ---");
                                foreach (Aspose.OCR.RecognitionResult result in results)
                                {
                                    extractedText.AppendLine(result.RecognitionText);
                                }
                                extractedText.AppendLine();
                            }
                        }

                        return new StandardResponse<string>
                        {
                            Status = true,
                            Message = "Tables extracted successfully from PDF using Aspose",
                            Data = extractedText.ToString()
                        };
                    }
                }
                finally
                {
                    // Cleanup: Delete temporary directory and files
                    if (Directory.Exists(tempPath))
                    {
                        Directory.Delete(tempPath, true);
                    }
                }
            }
            catch (Exception ex)
            {
                _logger.LogError(ex, "Error processing PDF with Aspose");
                return new StandardResponse<string>
                {
                    Status = false,
                    Message = ex.Message
                };
            }
        }

Here I am sending the attachment file path ( attachment below ) and i am getting the response
table and text content pdf.pdf (138.2 KB)
The response i am receiving here is

{
    "status": true,
    "message": "Tables extracted successfully from PDF using Aspose",
    "data": "--- Table content from Page 1, Image 1 ---\r\nThe Main Table required for evaluation header\nFirst Header Second Header Third Header Fourth Header\nFirst row First First row Second First row Third First row Fourth\ncolumn Sample column Sample column Sample column Sample\ndata data data data\nSecond row First Second row Second row Third Second row Fourth\ncolumn Sample Second colum column Sample column Sample\ndata Sample data data data\nThird row First Third row Second Third row Third Third row Fourth\ncolumn Sample column Sample column Sample column Sample\ndata data\nSample text for differentiation between Table data and outer data\nIpsum has been the industry's standard dummy text ever since the 1500s,when an\nunknown printer took a galley of type and scrambled it to make a type specimen book. It\nhas survived not only five centuries, but also the leap into electronic typesetting, remaining\nessentially\r\n\r\n"
}

So in the code, if i comment out the line

recognitionSettings.DetectAreasMode = Aspose.OCR.DetectAreasMode.TABLE;

The response i am getting is

{
    "status": true,
    "message": "Tables extracted successfully from PDF using Aspose",
    "data": "--- Table content from Page 1, Image 1 ---\r\nThe Main Table required for evaluation header\nFirst Header Second Header Third Header Fourth Header\nFirst row First First row Second First row Third First row Fourth\ncolumn Sample column Sample column Sample column Sample\ndata data data data\nSecond row First Second row Second row Third Second row Fourth\ncolumn Sample Second column column Sample column Sample\ndata Sample data data data\nThird row First Third row Second Third row Third Third row Fourth\ncolumn Sample column Sample column Sample column Sample\ndata data\nSample text for differentiation between Table data and outer data\nIpsum has been the industry's standard dummy text ever since the 1500s,when an\nunknown printer took a galley of type and scrambled it to make a type specimen book.It\nhas survived not only five centuries, but also the leap into electronic typesetting, remaining\nessentially\r\n\r\n"
}

So as we can see that is not making any difference. How can i extract just the Table out of my OCR with Aspose.OCR ?
Is there any method which can recognize the Tables specifically and return me just the Tables and its contents ?
For example So my requirement here is it should extract all the text and the tables preserving their positions

-Suppose in one page there are 4 lines of text, one table with content and another two lines of text after that

-Now the code should be able to extract the all the data , four lines in the begining, table content with the some functionality or Tags of any such as

content

and then another 2 lines

asad.ali · March 5, 2025, 10:37pm

@izzmekaif

As per our understanding, you want API to detect only table in given image and extract its data via OCR. OR if image has both table and other text, you want API to extract the text with some specification to differentiate between table data and other text. Please confirm if we got it correctly as we need to perform investigation against these requirements. We will be logging and investigation ticket sharing the ID with you.

izzmekaif · March 6, 2025, 4:41am

Yes you got that right Asad. Let me be a little more specific about it.

My requirement is that i need the text extracted for the full page. The page may contain paragraphs and Tables. The paragraph text can be extracted easily but when it comes to the Table, i want some unique identification over there to identify “this is the table data”, it can be any tags. Also the extraction should occur in the same positions as it is in the OCR. If the OCR contains Heading, Table and Paragraph. The extractracted text should also be as Heading, Table Contents and Paragraph from top to Bottom.

asad.ali · March 6, 2025, 4:34pm

@izzmekaif

We have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.

Issue ID(s): OCRNET-995

You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.

izzmekaif · March 7, 2025, 4:29am

Can i expect any ETA for the issue

asad.ali · March 7, 2025, 5:38pm

@izzmekaif

The issue is logged under free support forum where it will be prioritized on a first come first serve basis. We need to check whether this functionality is feasible or not. As soon as we make some progress towards ticket resolution, we will let you know. Please be patient and spare us some time.

We are sorry for the inconvenience.