Searchable pdf convertion using ASPOSE

venkataravi · July 1, 2014, 1:50am

Hi Team,

We purchased the licence for ASPOSE java api.

Using ASPOSE api we are unable to convert non searchable pdf into searchable pdf.

After goggling we came to know that ASPOSE did not support to convert the non searchable pdf into searchable pdf.

Please refer the link.

<span lang=“EN-US” style=“font-size:11.0pt;font-family:
“Calibri”,“sans-serif”;mso-fareast-font-family:Calibri;mso-fareast-theme-font:
minor-latin;mso-bidi-font-family:“Times New Roman”;mso-ansi-language:EN-US;
mso-fareast-language:EN-US;mso-bidi-language:AR-SA”><a href=“

The above link it is mentioned that we can convert the non searchable pdf into image and image to searchable pdf. We are able to convert non searchable pdf into image but while converting that image into searchable pdf we are facing the following error

"data.xml file is not available in resources”.

<o:p></o:p>

venkataravi · July 1, 2014, 9:07pm

Hi We purchased the licence for ASPOSE java api. We are facing issues while converting the non searchable pdf into searchable pdf

tilal.ahmad · July 2, 2014, 12:43am

Hi venkataravi,

Thanks for your inquiry. We will appreciate if you please share your source code and input/output files, so we will look into it and will provide you more information accordingly.

We are sorry for the inconvenience caused.

Best Regards,

venkataravikumar · July 2, 2014, 8:57am

Hi Tail Ahmad,

Thank you for responding.

We did not find any api to convert non-searchable pdf into searchable pdf.

We chose another alternate by converting non searchable pdf to image and image to Searchable pdf.

In this process we are able to convert non searchable pdf into image but we face issue while converting image into searchable pdf.

Please find my source codes and input files.

Please suggest us a way to complete the requirement.

codewarior · July 3, 2014, 4:59am

Hi Venkataravi,

Thanks for contacting support.

As per my understanding, you are able to convert non-searchable PDF file to Image format using Aspose.Pdf for Java but facing issue while creating searchable PDF from image file. Can you please share that either you are getting a problem while performing OCR on extracted image or facing issue while converting Image files to PDF format (which I think won’t help because converting images back to PDF will product non-searchable PDF file).

Besides this, when using Aspose.Pdf for .NET, you can use Aspose.Pdf in collaboration with some other OCR application supporting HOCR standards. A free google tesseract OCR can be used. So as described below, one can convert non-searchable PDF to searchable PDF document as described below. Once can install google tesseract OCR on his computer from http://code.google.com/p/tesseract-ocr/downloads/list and after that you will have tesseract.exe console application.

Below you can see usage example:

[C#]

public void Main<o:p></o:p>

{<o:p></o:p>

Document doc = new Document(“Input.pdf”);<o:p></o:p>

doc.Convert(CallBackGetHocr);<o:p></o:p>

doc.Save(“output.pdf”);<o:p></o:p>

}

private string CallBackGetHocr(System.Drawing.Image img)<o:p></o:p>

{<o:p></o:p>

string dir = @"c:\PdfTest";<o:p></o:p>

img.Save(dir + “test.jpg”);<o:p></o:p>

ProcessStartInfo info = new ProcessStartInfo(@“tesseract”);<o:p></o:p>

info.WindowStyle= ProcessWindowStyle.Hidden;<o:p></o:p>

info.Arguments = @“c:\pdftest\test.jpg c:\pdftest\out hocr”;<o:p></o:p>

Process p = new Process();<o:p></o:p>

p.StartInfo = info;<o:p></o:p>

p.Start();<o:p></o:p>

p.WaitForExit();<o:p></o:p>

StreamReader streamReader = new StreamReader(@“c:\pdftest\out.html”);<o:p></o:p>

string text = streamReader.ReadToEnd();<o:p></o:p>

streamReader.Close();<o:p></o:p>

return text;<o:p></o:p>

}

However in order to use the solution in Java, you need to try using jtesseract. For more information, you may consider looking over following links.

Besides this, we have an API named Aspose.OCR for Java which performs OCR over images in Java platform. So as per your requirement, you can first convert the pages of PDF file to Image format using instructions specified over Convert PDF pages to JPEG Image. Once the image files are generated, you can perform OCR using Aspose.OCR for Java. For further details, please visit Performing OCR on an Image. However when using this approach, the output is saved in simple Text file.

I am afraid the current release of Aspose.OCR for Java does not support the feature to perform OCR on image and save the output in HTML format but I have asked my fellow worker to further look into this requirement. Soon you will be updated with the required information. We are sorry for this inconvenience.

venkataravi · July 11, 2014, 8:00am

Hi Nayyer Shahbaz,

We have a non searchable pdf which contains scanned images and we need to convert the non searchable pdf having images to the searchable pdf.

We are unable to achieve this using ASPOSE java api.

We are facing issue with OCR on extracted image. We can able to get only content and failing to get images.

please suggest us to complete the requirement. It is very important and urgent also.

codewarior · July 13, 2014, 2:18pm

venkataravi:

We are facing issue with OCR on extracted image.

Hi Venkataravi

Thanks for sharing the details.

Can you please share how you are performing OCR over extracted images i.e. which API you are using.

venkataravi:

We can able to get only content and failing to get images.

Please share either you are facing any issue while extracting images from PDF file or facing problem while performing OCR over extracted images (as per your above statement). Please share the details so we may answer accordingly.

atulkalohatechnology · August 6, 2014, 10:42am

Hi Team,

I used the above code to create a searchable pdf but its not working for me, the converted pdf does not give the functionality to search the text.so can i get a proper code of c# which supports the pdf searchable functionality .

Thanks

Atul kadam

codewarior · August 7, 2014, 1:55pm

Hi Atul,

Thanks for contacting support and sorry for the delayed response.

Can you please share the source TIFF image so that we can test the scenario at our end. We are sorry for this inconvenience.