Searchable pdf convertion using ASPOSE

venkataravi · July 1, 2014, 1:50am

Hi Team,

We purchased the licence for ASPOSE Java API. Using ASPOSE API we are unable to convert non-searchable PDF into searchable PDF. After googling we came to know that ASPOSE did not support converting non-searchable PDF into searchable PDF. Please refer to the link.

The above link mentions that we can convert non-searchable PDF into an image and then convert the image to searchable PDF. We are able to convert non-searchable PDF into an image, but while converting that image into searchable PDF, we are facing the following error:

“data.xml file is not available in resources.”

venkataravi · July 1, 2014, 9:07pm

Hi We purchased the licence for ASPOSE java api. We are facing issues while converting the non searchable pdf into searchable pdf

tilal.ahmad · July 2, 2014, 12:43am

Hi venkataravi,

Thanks for your inquiry. We will appreciate if you please share your source code and input/output files, so we will look into it and will provide you more information accordingly.

We are sorry for the inconvenience caused.

Best Regards,

venkataravikumar · July 2, 2014, 8:57am

Hi Tail Ahmad,

Thank you for responding.

We did not find any api to convert non-searchable pdf into searchable pdf.

We chose another alternate by converting non searchable pdf to image and image to Searchable pdf.

In this process we are able to convert non searchable pdf into image but we face issue while converting image into searchable pdf.

Please find my source codes and input files.

Please suggest us a way to complete the requirement.

codewarior · July 3, 2014, 4:59am

Hi Venkataravi,

Thanks for contacting support.

As per my understanding, you are able to convert non-searchable PDF file to Image format using Aspose.Pdf for Java but facing issue while creating searchable PDF from image file. Can you please share whether you are getting a problem while performing OCR on extracted image or facing issue while converting Image files to PDF format (which I think won’t help because converting images back to PDF will produce non-searchable PDF file).

Besides this, when using Aspose.Pdf for .NET, you can use Aspose.Pdf in collaboration with some other OCR application supporting HOCR standards. A free Google Tessseract OCR can be used. So as described below, one can convert non-searchable PDF to searchable PDF document as described below. You can install Google Tessseract OCR on your computer from http://code.google.com/p/tesseract-ocr/downloads/list. After installation, you will have the tesseract.exe console application.

Below you can see usage example:

C#

public void Main()
{
    Document doc = new Document("Input.pdf");
    doc.Convert(CallBackGetHocr);
    doc.Save("output.pdf");
}

private string CallBackGetHocr(System.Drawing.Image img)
{
    string dir = @"c:\PdfTest\";
    img.Save(dir + "test.jpg");
    ProcessStartInfo info = new ProcessStartInfo(@"tesseract");
    info.WindowStyle= ProcessWindowStyle.Hidden;
    info.Arguments = @"c:\pdftest\test.jpg c:\pdftest\out hocr";
    Process p = new Process();
    p.StartInfo = info;
    p.Start();
    p.WaitForExit();
    StreamReader streamReader = new StreamReader(@"c:\pdftest\out.html");
    string text = streamReader.ReadToEnd();
    streamReader.Close();
    return text;
}

However, to use the solution in Java, you need to try using jtesseract. For more information, you may consider looking over the following links:

Besides this, we have an API named Aspose.OCR for Java, which performs OCR over images on the Java platform. So, as per your requirement, you can first convert the pages of the PDF file to Image format using instructions specified in Convert PDF pages to JPEG Image. Once the image files are generated, you can perform OCR using Aspose.OCR for Java. For further details, please visit Performing OCR on an Image. However, when using this approach, the output is saved in a simple Text file.

I am afraid the current release of Aspose.OCR for Java does not support the feature to perform OCR on an image and save the output in HTML format, but I have asked my fellow worker to further look into this requirement. Soon you will be updated with the required information. We are sorry for this inconvenience.

venkataravi · July 11, 2014, 8:00am

Hi Nayyer Shahbaz,

We have a non searchable pdf which contains scanned images and we need to convert the non searchable pdf having images to the searchable pdf.

We are unable to achieve this using ASPOSE java api.

We are facing issue with OCR on extracted image. We can able to get only content and failing to get images.

please suggest us to complete the requirement. It is very important and urgent also.

codewarior · July 13, 2014, 2:18pm

venkataravi:

We are facing issue with OCR on extracted image.

Hi Venkataravi

Thanks for sharing the details.

Can you please share how you are performing OCR over extracted images i.e. which API you are using.

venkataravi:

We can able to get only content and failing to get images.

Please share either you are facing any issue while extracting images from PDF file or facing problem while performing OCR over extracted images (as per your above statement). Please share the details so we may answer accordingly.

atulkalohatechnology · August 6, 2014, 10:42am

Hi Team,

I used the above code to create a searchable pdf but its not working for me, the converted pdf does not give the functionality to search the text.so can i get a proper code of c# which supports the pdf searchable functionality .

Thanks

Atul kadam

codewarior · August 7, 2014, 1:55pm

Hi Atul,

Thanks for contacting support and sorry for the delayed response.

Can you please share the source TIFF image so that we can test the scenario at our end. We are sorry for this inconvenience.