Unable to extract the text from PDF file

forasposeissues · June 20, 2024, 7:52am

Hi Team,

We are trying to extract the pdf content from the pdf file. But, while extraction aspose failed to extract full content of pdf.

Please find the code snippet.
Sample code we are using.
TextAbsorber textAbsorber = new TextAbsorber(new TextExtractionOptions(TextExtractionOptions.TextFormattingMode.Raw));
textAbsorber.visit(pdfDocument);
String pdfText = textAbsorber.getText();

Also, please find pdf file. Please find the screenshot in which we have highlighted the text which was not extracted by aspose utility.
Appreciate for help.

Thanks
Issue PDF extraction.zip (2.0 MB)

ilyazhuykov · June 20, 2024, 5:37pm

@forasposeissues
I’ll investigate the issue and write you as soon as possible

ilyazhuykov · June 21, 2024, 7:18am

@forasposeissues
It seems that text in the beginning isn’t text content but rather image
When opening Adobe Acrobat tries to convert it to text but when opened in browser you can’t even copy any content from mentioned parts
Nevertheless, I’ll add a task for development team to investigate if there any ways to extract text in such cases

ilyazhuykov · June 21, 2024, 7:25am

@forasposeissues
We have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.

Issue ID(s): PDFNET-57487

You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.

forasposeissues · August 13, 2024, 6:28am

Hi Team,

Any update on this issue?

ilyazhuykov · August 13, 2024, 7:26am

@forasposeissues
I checked the issue, it seems that there wasn’t any updates on it

forasposeissues · September 11, 2024, 11:12am

Hi Team,
Do we have any update on the issue?

ilyazhuykov · September 11, 2024, 11:48am

@forasposeissues
Currently the task has no updates
I’ll try to raise priority, hopefully it will help to resolve it faster

forasposeissues · October 28, 2024, 2:38pm

team, any update on issue?

ilyazhuykov · October 28, 2024, 3:44pm

@forasposeissues
Issue is still open, let me try to ask tomorrow development team directly if it’s possible to check it shortly

ilyazhuykov · October 29, 2024, 6:35pm

@forasposeissues
In short, currently there’s no integrated OCR in Aspose Pdf, therefore such images aren’t considered as text as in Pdf Documentation
So in order to resolve your issue you’ll need to process mentioned document with OCR supporting product in order to convert images to text

forasposeissues · November 20, 2024, 4:52pm

could you please help to share code snippe(sample)t how OCR can be used to extract text from image.

ilyazhuykov · November 20, 2024, 5:40pm

@forasposeissues
It depends on what tools you’re going to use
As example, here’s solution suggested for IronOcr .NET library:

```
using IronOcr;

var ocr = new IronTesseract();

using var input = new OcrInput();
input.LoadImage("attachment.png");
input.LoadPdf("report.pdf");

OcrResult result = ocr.Read(input);
string text = result.Text;
```

forasposeissues · November 20, 2024, 5:42pm

Is there solution provided by aspose? If yes, Please help us to share the details for Java

Thank you

ilyazhuykov · November 20, 2024, 5:49pm

@forasposeissues
Currently I don’t think there’s one present
When I asked if there any work in this direction I was said that there isn’t
so I’m afraid you’ll have to find some other means to convert image to text
I’ll ask again, maybe something changed recently, but I don’t have high hopes for that

UPD
I was suggested following solution

Basically it’s instruction how to work with Tesseract solution but at least it doesn’t require to use outer libraries

```
static string CallBackGetHocr(System.Drawing.Image img)
        {
            string tmpFile = System.IO.Path.GetTempFileName();
            try
            {
                System.Drawing.Bitmap bmp = new System.Drawing.Bitmap(img);

                bmp.Save(tmpFile, System.Drawing.Imaging.ImageFormat.Bmp);
                string inputFile = string.Concat('"', tmpFile, '"');
                string outputFile = string.Concat('"', tmpFile, '"');
                string arguments = string.Concat(inputFile, " ", outputFile, " -l eng hocr");
                string tesseractProcessName = @"C:\Program Files\Tesseract-OCR\Tesseract.exe";

                System.Diagnostics.ProcessStartInfo psi =
                    new System.Diagnostics.ProcessStartInfo(tesseractProcessName, arguments)
                    {
                        UseShellExecute = true,
                        CreateNoWindow = true,
                        WindowStyle = System.Diagnostics.ProcessWindowStyle.Hidden,
                        WorkingDirectory = System.IO.Path.GetDirectoryName(tesseractProcessName)
                    };

                System.Diagnostics.Process p = new System.Diagnostics.Process
                {
                    StartInfo = psi
                };
                p.Start();
                p.WaitForExit();

                System.IO.StreamReader streamReader = new System.IO.StreamReader(tmpFile + ".hocr");
                string text = streamReader.ReadToEnd();
                streamReader.Close();

                return text;
            }
            finally
            {
                if (System.IO.File.Exists(tmpFile))
                    System.IO.File.Delete(tmpFile);
                if (System.IO.File.Exists(tmpFile + ".hocr"))
                    System.IO.File.Delete(tmpFile + ".hocr");
            }
        }
```

forasposeissues · December 6, 2024, 10:15am

Hi team,

can you pls ask your team if we can use aspose to convert image to text. Since, we are using aspose lib for most of file operation.

Appreciate for help

ilyazhuykov · December 6, 2024, 10:40am

@forasposeissues
Based on discussions it doesn’t seem that it will be implemented soon and in case I add feature task it may take very long time to be implemented so I would recommend to use solution I suggested

forasposeissues · December 6, 2024, 11:07am

We cannot change the existing implementation. but, however can tweek the sample code to accept image and text if aspose provide solution.
This issue has low priority for us. We can wait for the solution. As we don’t want to go for outer library.

TextAbsorber textAbsorber = new TextAbsorber(new TextExtractionOptions(TextExtractionOptions.TextFormattingMode.Raw));
textAbsorber.visit(pdfDocument);
String pdfText = textAbsorber.getText();

Please help to open the Ticket for the same.

ilyazhuykov · December 6, 2024, 11:16am

@forasposeissues
We have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.

Issue ID(s): PDFNET-58807

You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.