Determine if a pdf is text-searchable

dchillman · November 14, 2008, 3:51pm

Is there something in the pdf.kit which can be used to determine if a pdf file is "text-searchable"? I've glanced through the documentation and nothing jumped out, but it never hurts to ask the developer if I'm missing something. FYI, this is of interest to me because I have a process to OCR pdf files. However, if the pdf is already text-searchable, I don't need to process it.

thanks

Dan

codewarior · November 14, 2008, 7:43pm

Hello Dan,

I am sorry to inform you that the feature to determine if a PDF file contains text is not yet supported.

As a workaround, you can extract the text from the PDF file and determine if it contains the text while analyzing the results. To make the text extraction process faster, you can limit the StartPage and EndPage to the first page of the PDF file, and can easily find out if it contains the text or not.

For information on how to extract text please visit, http://www.aspose.com/documentation/file-format-components/aspose.pdf.kit-for-.net-and-java/extract-text-from-pdf-document.html.

Also, please visit PdfExtractor Members.

dchillman · November 15, 2008, 9:08am

That is an interesting work-around. After extracting the text, I was going to use the GetWordCount to determine if any text was found. In the API documentation, it implies that GetWordCount is obsolete. Can you refer me to the current method I should use? Also, right now I am using the evaluation version of the pdf kit, and it adds the evaluation text to the pdf before it counts the words, so even if the pdf has only an image, the word count will be > 0. I assume once I have a license that the word count would be 0, but I'd like to know for sure. The attached pdf is an image and contains no searchable text. Can you do an extracttext on it and verify that the word count is 0?

thanks

Dan

forever · November 15, 2008, 8:42pm

Dear Dan,

It is quite difficult to count words in string for all languages and it is out of the scope of our product. So we have made the GetWordCount method obsoleted. You have to process the string by yourself to count the words in it. If you purchase the product, the evaluation text will disappear.

codewarior · November 16, 2008, 10:16am

Hello Dan,<?xml:namespace prefix = o ns = “urn:schemas-microsoft-com:office:office” />

You can try using the following code snippet to check if the Pdf file contains text of is an Image pdf.

[C#]

// Instantiate a memoryStream object to hold the extracted text from Document
MemoryStream ms = new MemoryStream();
//Instantiate PdfExtractor object
PdfExtractor extractor = new PdfExtractor();
//Bind the input PDF document to extractor
extractor.BindPdf(@"C:\pdftest\new_test_form_updated.pdf");

// Specify the Start page of the Pdf document

extractor.StartPage = 1;

// Specify the end page of the Pdf document. Limit the extraction to single page, rather than searching whole document

extractor.EndPage= 1;
//Extract text from the input PDF document
extractor.ExtractText();
//Save the extracted text to a text file
extractor.GetText(ms);
// Check if the MemoryStream length is greater than or equal to 1
if (ms.Length <= 1)
MessageBox.Show("Pdf is an Image Pdf");
else
MessageBox.Show("Pdf contains text");

Regarding Watermark issue, for the time being you can request the Temporary license and can test our product. Please visit the following link for information on How to get a Temporary License.

In case of any further query, feel free to share.

dchillman · November 18, 2008, 11:30am

thanks for the tip regarding the temp license, which I hadn't realized was available. Your snippet worked like a charm also.

Dan