Unable to extract the text from PDF file

Hi Team,

We are trying to extract the pdf content from the pdf file. But, while extraction aspose failed to extract full content of pdf.

Please find the code snippet.
Sample code we are using.
TextAbsorber textAbsorber = new TextAbsorber(new TextExtractionOptions(TextExtractionOptions.TextFormattingMode.Raw));
textAbsorber.visit(pdfDocument);
String pdfText = textAbsorber.getText();

Also, please find pdf file. Please find the screenshot in which we have highlighted the text which was not extracted by aspose utility.
Appreciate for help.

Thanks
Issue PDF extraction.zip (2.0 MB)

@forasposeissues
I’ll investigate the issue and write you as soon as possible

@forasposeissues
It seems that text in the beginning isn’t text content but rather image
When opening Adobe Acrobat tries to convert it to text but when opened in browser you can’t even copy any content from mentioned parts
Nevertheless, I’ll add a task for development team to investigate if there any ways to extract text in such cases

@forasposeissues
We have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.

Issue ID(s): PDFNET-57487

You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.

Hi Team,

Any update on this issue?

@forasposeissues
I checked the issue, it seems that there wasn’t any updates on it

Hi Team,
Do we have any update on the issue?

@forasposeissues
Currently the task has no updates
I’ll try to raise priority, hopefully it will help to resolve it faster

team, any update on issue?

@forasposeissues
Issue is still open, let me try to ask tomorrow development team directly if it’s possible to check it shortly

@forasposeissues
In short, currently there’s no integrated OCR in Aspose Pdf, therefore such images aren’t considered as text as in Pdf Documentation
So in order to resolve your issue you’ll need to process mentioned document with OCR supporting product in order to convert images to text

could you please help to share code snippe(sample)t how OCR can be used to extract text from image.

@forasposeissues
It depends on what tools you’re going to use
As example, here’s solution suggested for IronOcr .NET library:

```
using IronOcr;

var ocr = new IronTesseract();

using var input = new OcrInput();
input.LoadImage("attachment.png");
input.LoadPdf("report.pdf");

OcrResult result = ocr.Read(input);
string text = result.Text;
```

Is there solution provided by aspose? If yes, Please help us to share the details for Java

Thank you

@forasposeissues
Currently I don’t think there’s one present
When I asked if there any work in this direction I was said that there isn’t
so I’m afraid you’ll have to find some other means to convert image to text
I’ll ask again, maybe something changed recently, but I don’t have high hopes for that

UPD
I was suggested following solution

Basically it’s instruction how to work with Tesseract solution but at least it doesn’t require to use outer libraries

```
static string CallBackGetHocr(System.Drawing.Image img)
        {
            string tmpFile = System.IO.Path.GetTempFileName();
            try
            {
                System.Drawing.Bitmap bmp = new System.Drawing.Bitmap(img);

                bmp.Save(tmpFile, System.Drawing.Imaging.ImageFormat.Bmp);
                string inputFile = string.Concat('"', tmpFile, '"');
                string outputFile = string.Concat('"', tmpFile, '"');
                string arguments = string.Concat(inputFile, " ", outputFile, " -l eng hocr");
                string tesseractProcessName = @"C:\Program Files\Tesseract-OCR\Tesseract.exe";

                System.Diagnostics.ProcessStartInfo psi =
                    new System.Diagnostics.ProcessStartInfo(tesseractProcessName, arguments)
                    {
                        UseShellExecute = true,
                        CreateNoWindow = true,
                        WindowStyle = System.Diagnostics.ProcessWindowStyle.Hidden,
                        WorkingDirectory = System.IO.Path.GetDirectoryName(tesseractProcessName)
                    };

                System.Diagnostics.Process p = new System.Diagnostics.Process
                {
                    StartInfo = psi
                };
                p.Start();
                p.WaitForExit();

                System.IO.StreamReader streamReader = new System.IO.StreamReader(tmpFile + ".hocr");
                string text = streamReader.ReadToEnd();
                streamReader.Close();

                return text;
            }
            finally
            {
                if (System.IO.File.Exists(tmpFile))
                    System.IO.File.Delete(tmpFile);
                if (System.IO.File.Exists(tmpFile + ".hocr"))
                    System.IO.File.Delete(tmpFile + ".hocr");
            }
        }
```

Hi team,

can you pls ask your team if we can use aspose to convert image to text. Since, we are using aspose lib for most of file operation.

Appreciate for help

@forasposeissues
Based on discussions it doesn’t seem that it will be implemented soon and in case I add feature task it may take very long time to be implemented so I would recommend to use solution I suggested

We cannot change the existing implementation. but, however can tweek the sample code to accept image and text if aspose provide solution.
This issue has low priority for us. We can wait for the solution. As we don’t want to go for outer library.

TextAbsorber textAbsorber = new TextAbsorber(new TextExtractionOptions(TextExtractionOptions.TextFormattingMode.Raw));
textAbsorber.visit(pdfDocument);
String pdfText = textAbsorber.getText();

Please help to open the Ticket for the same.

1 Like

@forasposeissues
We have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.

Issue ID(s): PDFNET-58807

You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.