Searchable PDF

adoelen · July 11, 2011, 4:07pm

Im not sure if its possible with the current offering, but a killer
feature would be the ability to OCR images and create pdf files from
those images containing both the image and “selectable” text. The OCR’ed text is not visible, but you can search for it in the pdf or copy and paste it when you open it for example in Adobe reader. Other products offer this functionality and we have to resort to those for this functionality. Will this be possible?

muhammad.ijaz · July 13, 2011, 8:12am

Hi
Andre,

This feature is not available at the moment but a request in this regards has been logged into our issue tracking system as <!–[if gte mso 9]>
<w:WordDocument>
<w:View>Normal</w:View>
<w:Zoom>0</w:Zoom>
<w:TrackMoves/>
<w:TrackFormatting/>
<w:PunctuationKerning/>
<w:ValidateAgainstSchemas/>
<w:SaveIfXMLInvalid>false</w:SaveIfXMLInvalid>
<w:IgnoreMixedContent>false</w:IgnoreMixedContent>
<w:AlwaysShowPlaceholderText>false</w:AlwaysShowPlaceholderText>
<w:DoNotPromoteQF/>
<w:LidThemeOther>EN-US</w:LidThemeOther>
<w:LidThemeAsian>X-NONE</w:LidThemeAsian>
<w:LidThemeComplexScript>AR-SA</w:LidThemeComplexScript>
<w:Compatibility>
<w:BreakWrappedTables/>
<w:SnapToGridInCell/>
<w:WrapTextWithPunct/>
<w:UseAsianBreakRules/>
<w:DontGrowAutofit/>
<w:SplitPgBreakAndParaMark/>
<w:EnableOpenTypeKerning/>
<w:DontFlipMirrorIndents/>
<w:OverrideTableStyleHps/>
</w:Compatibility>
<m:mathPr>
<m:mathFont m:val=“Cambria Math”/>
<m:brkBin m:val=“before”/>
<m:brkBinSub m:val="–"/>
<m:smallFrac m:val=“off”/>
<m:dispDef/>
<m:lMargin m:val=“0”/>
<m:rMargin m:val=“0”/>
<m:defJc m:val=“centerGroup”/>
<m:wrapIndent m:val=“1440”/>
<m:intLim m:val=“subSup”/>
<m:naryLim m:val=“undOvr”/>
</m:mathPr></w:WordDocument>
<![endif]–><!–[if gte mso 10]>

/* Style Definitions */ table.MsoNormalTable {mso-style-name:"Table Normal"; mso-tstyle-rowband-size:0; mso-tstyle-colband-size:0; mso-style-noshow:yes; mso-style-priority:99; mso-style-parent:""; mso-padding-alt:0in 5.4pt 0in 5.4pt; mso-para-margin:0in; mso-para-margin-bottom:.0001pt; mso-pagination:widow-orphan; font-size:11.0pt; font-family:"Calibri","sans-serif"; mso-ascii-font-family:Calibri; mso-ascii-theme-font:minor-latin; mso-hansi-font-family:Calibri; mso-hansi-theme-font:minor-latin; mso-bidi-font-family:Arial; mso-bidi-theme-font:minor-bidi;}

<![endif]–><span style=“font-size:13.5pt;font-family:“Arial”,“sans-serif”;
mso-fareast-font-family:Calibri;mso-fareast-theme-font:minor-latin;mso-ansi-language:
EN-US;mso-fareast-language:EN-US;mso-bidi-language:AR-SA”>OCR-29119. We will keep you updated on this issue in this thread.

You can also use a combination of Aspose.OCR for .NET and Aspose.PDF for .NET for this purpose. Aspose.OCR for .NET can be used to extract text and Aspose.PDF for .NET can be used to insert image and extracted text (as selectable text) in the output PDF.

Please feel free to contact us in case you have further comments or questions.

Best Regards,

balazsdkw · July 22, 2011, 1:40pm

Could you please let me know how to accomplish extracting text using OCR and then using Aspose.PDF to create the PDF that includes the extracted text?

muhammad.ijaz · July 22, 2011, 4:03pm

Hi Balazs,

Please check this blog post https://blog.aspose.com/2011/07/20/extract-text-from-pdf-including-images-combine-aspose.pdf-and-aspose.ocr for more information on how to extract text and images from a PDF and extract text from images.

Check this topic http://www.aspose.com/documentation/.net-components/aspose.pdf-for-.net/create-a-hello-world-pdf-document-through-api.html for more information on how to generate a PDF and http://www.aspose.com/documentation/.net-components/aspose.pdf-for-.net/add-text-in-an-existing-pdf-file.html for more information on how to add text in a PDF.

Please feel free to contact us in case you have further comments or questions.

Best Regards,

Xtreme1 · April 24, 2012, 11:45am

Can you please provide examples on how to convert Tiff images to searchable PDF's?

I am looking to purchase the product but want to confirm this can be accomplished and test the performance.

thank you,

Rob

muhammad.ijaz · April 25, 2012, 5:25am

Hi Rob,

You can use Aspose.OCR for .NET to extract text from BMP or TIFF images. Please check Recognition|Documentation and Recognition settings|Documentation for more details.

After extracting text, you can use Aspose.PDF for .NET to create a new PDF file and inject extracted text in the PDF. Please check Example of Hello World using C# language|Aspose.PDF for .NET and Add Text to PDF using C#|Aspose.PDF for .NET for more details and feel free to contact us in case you have further comments or questions.

Best Regards,

goossen.de.bruin.sol · August 29, 2013, 8:15am

Hi Muhammad,

I am also looking for a way to convert an image (gif/bmp/tiff) to a searchable PDF.

I saw your example. If I understand it right there are two steps:

1. Extract the text from the image using Aspose.OCR

2. Add the image and the text to a PDF using Aspose.PDF

The problem however is that the text, that is the result of the OCR, is not provided with position information. So I do not know on what position I have to add it in the PDF.

Am I correct or am I missing something?

tilal.ahmad · August 30, 2013, 6:36am

Hi Goossen,

Thanks for your inquiry. I’m afraid currently Aspose.Ocr is not quite mature. We are facing some issue in text recognition accuracy and its coordinates. Our development team is working hard to fix these issue and investigating some new algorithms for the purpose.

As a workaround you can create a searchable PDF document form image using Aspose.Pdf with collaboration of some other OCR application supporting HOCR standards. You can use free google tesseract OCR. In first step please convert your image to PDF by following this documentation link and later can convert it to searchable PDF document as described following.

Please install google tesseract OCR on your computer from tesseract-ocr · GitHub and after that you will have tesseract.exe console application.

Below you can see usage example:

[C#]

private string CallBackGetHocr(System.Drawing.Image img)
{
string dir = @“c:\PdfTest”;
img.Save(dir + “test.jpg”);
ProcessStartInfo info = new ProcessStartInfo(@“tesseract”);
info.WindowStyle= ProcessWindowStyle.Hidden;
info.Arguments = @“c:\pdftest\test.jpg c:\pdftest\out hocr”;
Process p = new Process();
p.StartInfo = info;
p.Start();
p.WaitForExit();
StreamReader streamReader = new StreamReader(@“c:\pdftest\out.html”);
string text = streamReader.ReadToEnd();
streamReader.Close();
return text;
}

public void Main{
Document doc = new Document(“Input.pdf”);
doc.Convert(CallBackGetHocr);
doc.Save(“output.pdf”);

}

Please feel free to contact us for any further assistance.

Best Regards,

pradeepdone · February 2, 2018, 6:31am

Hi,

Just checking is the Searchable PDF functionality added in the latest version of Aspose.

ikram.haq · February 2, 2018, 7:56am

@pradeepdone,

Thank you for writing us. The feature you are looking for is not available. We are sorry for the inconvenience.

awais.hafeez · March 29, 2018, 5:23am

The issues you have found earlier (filed as ) have been fixed in this Aspose.Words for JasperReports 18.3 update.