Need to extract text from the image along with the text present in the PDF

Currently am using TextAbsorber to extract the text present in the PDF but I do have pdf with images also. Need to extract the text from the image also. Below is the code which am using to extract the text in c#.

private static string ExtractTextFromPDF([FromForm] FileUploadRequest convertRequest)
{
try
{
if (convertRequest.FileData != null && convertRequest.FileData.Length > 0)
{
var pdfDocument = new Document(new MemoryStream(convertRequest.FileData));

         // Create TextAbsorber object to extract text
         var textAbsorber = new TextAbsorber();
         // Accept the absorber for all the pages
         pdfDocument.Pages.Accept(textAbsorber);
         // Get the extracted text
         var extractedText = textAbsorber.Text;
         return extractedText;
     }
     else
     {
         return string.Empty;
     }
 }
 catch (Exception ex)
 {
     throw ex;
 }

}

Here the pdf is sent in the convertRequest.FileData parameter.
Attaching pdf along with image for reference. Please guide me on how to extract the text from the images.(The images can be svg)
Kroll 2022 Annual Report.zip (8.8 MB)

@Vijayalakshmisridharan

Do you mean the images like below from where you want to extract text?

image.png (128.2 KB)

Hi,
I have few edge case where I cannot extract the information from the PDF.

  1. One such example I have attached. I have attached the extracted text from the PDF also.
    Ex: 10-trends-to-watch-heading-into-2023 in page 36. I don’t see any info from the image
    Kroll_MissingInfo.png (148.3 KB)

2.When there are multi columns in the pdf the way textAbsorber extracts the text has issues.
It is reading across columns instead of down each column.

ex: Nova Ukraine , Sandy Hook Promise, Americares Hurricane Ian Fund in page 11.

But when i convert the PDF to HTML and extract the text I don’t see the multi column issue. Is there any option to avoid this in TextAbsorber as well.

Below is the Code to extract the text from HTML

using (var document = new HTMLDocument(convertedHTML, “”))
{
Aspose.Html.Dom.Traversal.INodeIterator iterator = document.CreateNodeIterator(document, Aspose.Html.Dom.Traversal.Filters.NodeFilter.SHOW_TEXT, new StyleFilter());
var sb = new StringBuilder();
Aspose.Html.Dom.Node node;
while ((node = iterator.NextNode()) != null)
{
sb.Append(node.NodeValue);
}

 convertedText = sb.ToString();

}

krol_extractedTextFromhtml.zip (18.5 KB)

krol_extractedTextFrompdf.zip (20.0 KB)

@Vijayalakshmisridharan

We are checking it and will get back to you shortly.

@Vijayalakshmisridharan

Thanks for your patience and we are sorry for the delayed response. We have been investigating the case initially and testing using both Aspose.PDF and Aspose.OCR APIs.

Nevertheless, about the below issue # 1:

This is not a bug but expected behavior of the API. Aspose.PDF does not process images to extract text or perform OCR on them. TextAbsorber Class detects text from the PDF and skips image objects. You will need to perform OCR operation on the images to extract text information from them using Aspose.OCR.

About the below issue # 2:

We were able to replicate this issue in our environment using 24.4 version of the API. Therefore, it has been logged as PDFNET-57054 in our issue tracking system for the sake of rectification. We will inform you as soon as it is resolved.

Furthermore, please note that Aspose.OCR offers functionality to extract text from scanned PDF documents only. That means you cannot perform OCR feature on such PDFs that have mixed content (text + images). You can however convert all pages or specific pages of the document into images and perform OCR on them using below code snippet:

string pdfPath = $"{dataDir}Kroll 2022 Annual Report.pdf";

List<Aspose.OCR.RecognitionResult> ocrResults = new List<Aspose.OCR.RecognitionResult>();
Aspose.OCR.AsposeOcr api = new Aspose.OCR.AsposeOcr();

// Resolution resolution = new Resolution(300);
// PngDevice imageDevice = new PngDevice(resolution);
PngDevice imageDevice = new PngDevice();
Document pdfDocument = new Document(pdfPath);

for (int pageCount = 1; pageCount <= pdfDocument.Pages.Count; pageCount++)
{
    using (MemoryStream ms = new MemoryStream())
    {
        // Convert a particular page and save the image to stream
        imageDevice.Process(pdfDocument.Pages[pageCount], ms);

        Aspose.OCR.OcrInput input = new Aspose.OCR.OcrInput(Aspose.OCR.InputType.SingleImage);
        input.Add(ms);
        var recognResult = api.Recognize(input, new Aspose.OCR.RecognitionSettings { DetectAreasMode = Aspose.OCR.DetectAreasMode.TABLE });
        ocrResults.Add(recognResult[0]);
        ms.Close();
    }
}

Aspose.OCR.AsposeOcr.SaveMultipageDocument(dataDir + "/res.txt", Aspose.OCR.SaveFormat.Text, ocrResults);

Please note that above code snippet converts all pages of the PDF and perform OCR on them to generate results. In your case, you can extract text only from the pages that have images in them e.g. Page 36. Please feel free to let us know in case you need any kind of information or you face any difficulty while using API features.

Hi,
I did try the above code with same pdf but as memory stream with page number as 36. But am not getting the expected result. It is not extracting the text. Attaching the txt file which got generated. Please let me know if you are also facing the same.

SAMPLE CODE:
List<Aspose.OCR.RecognitionResult> ocrResults = new List<Aspose.OCR.RecognitionResult>();
Aspose.OCR.AsposeOcr api = new Aspose.OCR.AsposeOcr();

PngDevice imageDevice = new PngDevice();
Document pdfDocument = new Document(new MemoryStream(fileUploadRequest.FileData));

int pageCount = 36;
using (MemoryStream m = new MemoryStream())
{
    // Convert a particular page and save the image to stream
    imageDevice.Process(pdfDocument.Pages[pageCount], m);

    Aspose.OCR.OcrInput input = new Aspose.OCR.OcrInput(Aspose.OCR.InputType.SingleImage);
    input.Add(m);
    var recognResult = api.Recognize(input, new Aspose.OCR.RecognitionSettings { DetectAreasMode = Aspose.OCR.DetectAreasMode.TABLE });
    ocrResults.Add(recognResult[0]);
    m.Close();
}

var dataDir = “C:\Users\Document”;
Aspose.OCR.AsposeOcr.SaveMultipageDocument(dataDir + “/res.txt”, Aspose.OCR.SaveFormat.Text, ocrResults);

OCRtext.zip (299 Bytes)

@Vijayalakshmisridharan

Have you used Aspose.OCR with a valid license? Can you please make sure that you are setting license for it as well before using its methods?

Hi,
I have Aspose.Total.license. I have set the license now, but still the generated text isn’t looking great as expected. It does have lot of junk. Is there any way to get the image alone and extract text from them or any alternative suggestion. Attaching the generated code for reference. If it possible for you to send the text which you have extracted from the above PDF file. Just needed to compare them.

OCRText_1.zip (1.3 KB)

@Vijayalakshmisridharan

Attached is the result that we obtained in our environment.
res.zip (23.2 KB)

We also noticed some garbage characters in it for which an issue as OCRNET-834 has been logged in our issue tracking system. We will surely work on rectifying it and let you know as soon as it is resolved. Please be patient and spare us some time. We are sorry for the inconvenience faced.