A pdf with scanned and searchable pages

ChrisWongASL · August 19, 2022, 8:36am

We are using RecognizePdf method to recognize text, following by AsposeOcr.SaveMultipageDocument method to save. It works if pdf only has scanned pages. However if the pdf have scanned and searchable pages, the searchable would not be recongnized and not save.
A test pdf:
Testing OCR.pdf (61.5 KB)
A result pdf:
converted-Testing OCR.pdf (17.6 KB)

Sample code:
var documentRecognitionSettings = new DocumentRecognitionSettings
{
StartPage = 0,
PagesNumber = 2,
DetectAreas = true,
AutoDenoising = true,
DetectAreasMode = DetectAreasMode.COMBINE
};

// Recognize images from PDF
List res = asposeOcr.RecognizePdf(“Testing OCR.pdf”, documentRecognitionSettings);

AsposeOcr.SaveMultipageDocument(“converted-Testing OCR.pdf”, Aspose.OCR.SaveFormat.Pdf, res);

asad.ali · August 19, 2022, 6:33pm

@ChrisWongASL

At this moment we can extract only text from the image (scanned PDF). For the searchable text from PDF, you must extract using Aspose.PDF or other libraries. We are afraid that it is not supported in the API.

rogerg · August 22, 2022, 10:45pm

Hi, @ChrisWongASL

Would you please explain what exactly you are trying to accomplish? For instance, do you want to parse pdf files with hybrid content, such as, some pages with PDF text, some with scanned images, some with text layers (searcheable pdf), or maybe a mix of it, and get all the possible text from these pages in an ordered form?

Depending on the scenario, I think it would be necessary to use both Aspose.OCR and Aspose.PDF, as @asad.ali mentioned.

Thanks

emoore2000 · October 20, 2022, 2:25am

Hello, I’m new to PDF’s so I may use the wrong terms. We receive pdf’s that are sent via email by multi-function scanners. Sometimes the source material is printed text, other times it may be handwritten. We currently use acrobat and autobatch to try and ocr these pdf documents. We just want them to be searchable once in our system, we are not extracting data. These documents are typically 20 to 100 pages.
I started with your sample ‘RecognizeAndSaveSearchablePdf’. I then learned I had to manipulate the PageStart and PageCount and do batches of ten pages so I added some looping logic. I am able to process a 24 page pdf, and I get the text output after saving the file. However, the PDF is now horribly corrupted. I’m not sure but I suspect it’s the pdf and its mixture of content.
I’m also thinking I need a way to tell if the pdf is all scanned pages and maybe use the process image method instead?

Thank you for your help

asad.ali · October 20, 2022, 3:06pm

@emoore2000

In this case, you need to process PDF first using Aspose.PDF API in order to check whether it contains only images or text. For the images, you can extract text from them using Aspose.OCR and printed text can be extracted using Aspose.PDF. For example:

Check presence text and images

emoore2000 · October 25, 2022, 10:01pm

I think I have my code correct, but I always end up with a corrupted pdf. All of my pdfs are both text and image. I just want the final pdf to be searchable, I do not want to look at the contents. I am not sure my code is writing out the file correctly. I create a recognition result list, add image scans to it inside a ‘HasNextImage()’ Loop. Then count the pages and add more text scan results before doing a ‘savemultipagedocument’. Is that not correct? Here is some psudocode, my sample is over three hundred lines of code.
Thank you for your response.

//save all of our text recognition resluts in a list
List<Aspose.OCR.RecognitionResult> ocrResults = new List<RecognitionResult>();

// Open input PDF
PdfExtractor extractor = new PdfExtractor();
extractor.ExtractImageMode = ExtractImageMode.DefinedInResources;
extractor.BindPdf(filepath);
extractor.ExtractImage();

Aspose.OCR.AsposeOcr api = new Aspose.OCR.AsposeOcr();
// Get all the extracted images
while (extractor.HasNextImage())
{
	Program.imgCount++;
	using (MemoryStream ms = new MemoryStream())
	{
		extractor.GetNextImage(ms);
		var tempres = ocr.RecognizeImage(ms,rs);
		if(tempres != null )
		{
			ocrResults.Add(tempres);
		}
	}
}

//start of text scan...   
MemoryStream ms = new MemoryStream();
//get page count
Document document = new Document(filepath);
int pagecount = document.Pages.Count;

//we have to grind through in blocks of ten pages... just do the first block now
ocrResults.AddRange(api.RecognizePdf(ms, settings));
if (pagecount > 10)
{
	int pstart = 10;
	while (pagecount - pstart > 0)
	{
		settings.StartPage = pstart;
		settings.PagesNumber = pagecount - pstart > 10 ? 10 : pagecount - pstart;
		Console.WriteLine("OCR[4]... start:{start} count:{count} at {time}", settings.StartPage, settings.PagesNumber, DateTime.Now.ToString("h:mm:ss"));
		ocrResults.AddRange(api.RecognizePdf(ms, settings));
		pstart += 10;
	}
}

Aspose.OCR.AsposeOcr.SaveMultipageDocument(filepath + "_OCRd.pdf", Aspose.OCR.SaveFormat.Pdf, ocrResults);

asad.ali · October 26, 2022, 5:46am

@emoore2000

Can you please share the sample PDF for our reference so that we can test the scenario in our environment and address it accordingly?

emoore2000 · October 26, 2022, 3:16pm

Is there a way I can send you the pdf privately?
Thank you for your help.

emoore2000 · October 26, 2022, 4:30pm

Here is a three page pdf, the first two are image scans but the third is a text page.
Thank you.376150_Designation.pdf (80.8 KB)

asad.ali · October 26, 2022, 8:22pm

@emoore2000

We have created an investigation task as OCRNET-599 in our issue tracking system to analyze your case. We will look into details of the logged ticket and let you know as soon as it is resolved. Please be patient and spare us some time.

We are sorry for the inconvenience.

emoore2000 · November 3, 2022, 2:11pm

Good morning, is there any movement on this? We’ve found another pdf library that made that same pdf document (and most others) searchable. I will need to make a decision on which library to use.
Thank you

asad.ali · November 4, 2022, 6:28am

@emoore2000

The best way to get the correct PDF - is to convert PDF into images and recognize them. Because Aspose.OCR can’t extract or draw text on the PDF files with combined content. Our library works only with scanned PDFs with images. But using Aspose.PDF you can convert PDF into images, recognize images and save results into SearchablePdf:

 string pdfPath = testDataDir + "/Issues/376150_Designation.pdf";

            List<Aspose.OCR.RecognitionResult> ocrResults = new List<RecognitionResult>();
            AsposeOcr api = new AsposeOcr();

            Resolution resolution = new Resolution(300);
            PngDevice imageDevice = new PngDevice(resolution);
            Document pdfDocument = new Document(pdfPath);

            for (int pageCount = 1; pageCount <= pdfDocument.Pages.Count; pageCount++)
            {
                using (MemoryStream ms = new MemoryStream())
                {
                    // Convert a particular page and save the image to stream
                    imageDevice.Process(pdfDocument.Pages[pageCount], ms);
                    // var recognResult = api.RecognizeImage(ms, new RecognitionSettings()); //choose more useful
                    var recognResult = api.RecognizeImage(ms, new RecognitionSettings { DetectAreasMode = DetectAreasMode.COMBINE});
                    ocrResults.Add(recognResult);
                    ms.Close();
                }
            }

            Aspose.OCR.AsposeOcr.SaveMultipageDocument("_OCRd.pdf", Aspose.OCR.SaveFormat.Pdf, ocrResults);

_OCRd.pdf (1.6 MB)

emoore2000 · November 9, 2022, 8:16am

Good morning,
I am not the topic owner and cannot download the file you attached. Can you make it available to me? Thank you.

emoore2000 · November 9, 2022, 4:41pm

I also wanted to mention that when I OCR that pdf the size goes from 89KB to 1474KB. Your example seems to have gone from 80.8KB to 1.6MB. Are there some settings that are missing to prevent this?
Thank you

asad.ali · November 9, 2022, 6:39pm

@emoore2000

You can please download the file from this link. Also, the PDF document now contains images as well as hidden layer of text. PDF preserves these type of resources in it that results increase in size. You can compress the size using Aspose.PDF.