We are trying to extract the pdf content from the pdf file. But, while extraction aspose failed to extract full content of pdf.
Please find the code snippet.
Sample code we are using.
TextAbsorber textAbsorber = new TextAbsorber(new TextExtractionOptions(TextExtractionOptions.TextFormattingMode.Raw));
textAbsorber.visit(pdfDocument);
String pdfText = textAbsorber.getText();
Also, please find pdf file. Please find the screenshot in which we have highlighted the text which was not extracted by aspose utility.
Appreciate for help.
@forasposeissues
It seems that text in the beginning isn’t text content but rather image
When opening Adobe Acrobat tries to convert it to text but when opened in browser you can’t even copy any content from mentioned parts
Nevertheless, I’ll add a task for development team to investigate if there any ways to extract text in such cases
@forasposeissues
We have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.
Issue ID(s): PDFNET-57487
You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.
@forasposeissues
In short, currently there’s no integrated OCR in Aspose Pdf, therefore such images aren’t considered as text as in Pdf Documentation
So in order to resolve your issue you’ll need to process mentioned document with OCR supporting product in order to convert images to text
@forasposeissues
It depends on what tools you’re going to use
As example, here’s solution suggested for IronOcr .NET library:
```
using IronOcr;
var ocr = new IronTesseract();
using var input = new OcrInput();
input.LoadImage("attachment.png");
input.LoadPdf("report.pdf");
OcrResult result = ocr.Read(input);
string text = result.Text;
```
@forasposeissues
Currently I don’t think there’s one present
When I asked if there any work in this direction I was said that there isn’t
so I’m afraid you’ll have to find some other means to convert image to text
I’ll ask again, maybe something changed recently, but I don’t have high hopes for that
UPD
I was suggested following solution
Basically it’s instruction how to work with Tesseract solution but at least it doesn’t require to use outer libraries
@forasposeissues
Based on discussions it doesn’t seem that it will be implemented soon and in case I add feature task it may take very long time to be implemented so I would recommend to use solution I suggested
We cannot change the existing implementation. but, however can tweek the sample code to accept image and text if aspose provide solution.
This issue has low priority for us. We can wait for the solution. As we don’t want to go for outer library.
@forasposeissues
We have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.
Issue ID(s): PDFNET-58807
You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.