Convert Non Searachable PDF to Searchable PDF using Aspose.PDF

Hari2506 · December 17, 2015, 5:10pm

Hi Team,

I want to convert non-searchable pdf to searchable pdf , please let us know how we can achieve this.
We already have aspose with us.

pls help us on this asap.

tilal.ahmad · December 18, 2015, 12:29am

Hi there,

Thanks for your inquiry. I’m afraid currently searchable PDF is not supported with Aspose components. As Aspose.Ocr is not quite mature. We are facing some issue in text recognition accuracy and its coordinates. Our development team is working hard to fix these issue and investigating some new algorithms for the purpose.

As a workaround you can create a searchable PDF document form image using Aspose.Pdf with collaboration of some other OCR application supporting HOCR standards. You can use free google tesseract OCR for the purpose. Please check following documentation link for details.

Convert Non-Searchable PDF to Searchable PDF

Please feel free to contact us for any further assistance.

Best Regards,

Hari2506 · December 18, 2015, 1:13am

Hi Tilal,

Thanks for your reply , Can you explain what is this class? Do I need to put that piece of code in main method of Java.

pls provide how to fit that piece of code in our java class?

Hari2506 · December 18, 2015, 6:39am

Below is my code pls have a look , but i dont see the CallBackGetHOCR being called, what is going wrong, pls let me know where to correct

import java.io.File;

import java.io.FileNotFoundException;

import java.io.IOException;

import java.util.Scanner;

import javax.imageio.ImageIO;

import com.aspose.pdf.Document;

import com.aspose.pdf.Document.CallBackGetHocr;

public class asadad {

public static void main (String args[]){

Document doc = new Document("C:\\Users\\harikrishna\\Desktop\\Datacap9.0\\NSC.pdf");

final String myDir = "C:\\Users\\harikrishna\\Desktop\\Datacap9.0\\";

doc.convert(new CallBackGetHocr()

{

@Override

public String invoke(java.awt.image.BufferedImage img)

{

File outputfile = new File(myDir + "AA.jpg");

try

{

ImageIO.write(img, "jpg", outputfile);

} catch (IOException e1)

{

e1.printStackTrace();

}

try

{

System.out.println("tesseract" + " " + myDir + "AA.jpg" + " " + myDir + "out hocr");

java.lang.Process process = Runtime.getRuntime()

.exec("tesseract" + " " + myDir + "AA.jpg" + " " + myDir + "out hocr");

process.waitFor();

} catch (IOException e)

{

e.printStackTrace();

} catch (InterruptedException e)

{

e.printStackTrace();

}

// reading out.html to string

File file = new File(myDir + "out.html");

StringBuilder fileContents = new StringBuilder((int) file.length());

Scanner scanner = null;

try

{

scanner = new Scanner(file);

String lineSeparator = System.getProperty("line.separator");

while (scanner.hasNextLine())

{

fileContents.append(scanner.nextLine() + lineSeparator);

}

} catch (FileNotFoundException e)

{

e.printStackTrace();

} finally

{

if (scanner != null)

scanner.close();

}

// deleting temp files

File fileOut = new File(myDir + "out.html");

if (fileOut.exists())

{

fileOut.delete();

}

File fileTest = new File(myDir + "test.jpg");

if (fileTest.exists())

{

fileTest.delete();

}

return fileContents.toString();

}

});

doc.save("c:/pdftest/aaa.pdf");;

}

tilal.ahmad · December 21, 2015, 1:13am

Hi there,

Thanks for your feedback. I have again tested the scenario at my end ,it is working fine.

Please double check whether tesseract is installed on your system. You can check this by entering “tesseract help” at command prompt.

We are sorry for the inconvenience caused.

Best Regards,

Hari2506 · December 21, 2015, 2:45am

Thanks…

Sorted the issue and am able to make simple pdf ,
however I have another pdf with images and am getting below exception.

Exception in thread “main” class com.aspose.pdf.internal.p386.z352: unexpected end of file in an attribute value Line 14, position 1027.
com.aspose.pdf.internal.p386.z554.m16(Unknown Source)
com.aspose.pdf.internal.p386.z554.m14(Unknown Source)
com.aspose.pdf.internal.p386.z554.m9(Unknown Source)
com.aspose.pdf.internal.p386.z554.m84(Unknown Source)
com.aspose.pdf.internal.p386.z554.m82(Unknown Source)
com.aspose.pdf.internal.p386.z554.m82(Unknown Source)
com.aspose.pdf.internal.p386.z554.m8(Unknown Source)
com.aspose.pdf.internal.p386.z553.m8(Unknown Source)
com.aspose.pdf.internal.p386.z320.m4(Unknown Source)
com.aspose.pdf.internal.p386.z320.m4(Unknown Source)
com.aspose.pdf.internal.p386.z320.m4(Unknown Source)
com.aspose.pdf.internal.p386.z320.m4(Unknown Source)
com.aspose.pdf.internal.p386.z320.m4(Unknown Source)
com.aspose.pdf.internal.p386.z320.m4(Unknown Source)
com.aspose.pdf.internal.p386.z320.m4(Unknown Source)
com.aspose.pdf.internal.p386.z320.m2(Unknown Source)
com.aspose.pdf.internal.p386.z320.m1(Unknown Source)
com.aspose.pdf.internal.p386.z320.m17(Unknown Source)
com.aspose.pdf.internal.p626.z5.m1(Unknown Source)
com.aspose.pdf.internal.p626.z5.m1(Unknown Source)
com.aspose.pdf.ADocument.convert(Unknown Source)
com.aspose.pdf.Document.convert(Unknown Source)
sg.gov.sla.ers.util.OCRTesting.main(OCRTesting.java:87)
at com.aspose.pdf.internal.p386.z554.m16(Unknown Source)
at com.aspose.pdf.internal.p386.z554.m14(Unknown Source)
at com.aspose.pdf.internal.p386.z554.m9(Unknown Source)
at com.aspose.pdf.internal.p386.z554.m84(Unknown Source)
at com.aspose.pdf.internal.p386.z554.m82(Unknown Source)
at com.aspose.pdf.internal.p386.z554.m82(Unknown Source)
at com.aspose.pdf.internal.p386.z554.m8(Unknown Source)
at com.aspose.pdf.internal.p386.z553.m8(Unknown Source)
at com.aspose.pdf.internal.p386.z320.m4(Unknown Source)
at com.aspose.pdf.internal.p386.z320.m4(Unknown Source)
at com.aspose.pdf.internal.p386.z320.m4(Unknown Source)
at com.aspose.pdf.internal.p386.z320.m4(Unknown Source)
at com.aspose.pdf.internal.p386.z320.m4(Unknown Source)
at com.aspose.pdf.internal.p386.z320.m4(Unknown Source)
at com.aspose.pdf.internal.p386.z320.m4(Unknown Source)
at com.aspose.pdf.internal.p386.z320.m2(Unknown Source)
at com.aspose.pdf.internal.p386.z320.m1(Unknown Source)
at com.aspose.pdf.internal.p386.z320.m17(Unknown Source)
at com.aspose.pdf.internal.p626.z5.m1(Unknown Source)
at com.aspose.pdf.internal.p626.z5.m1(Unknown Source)
at com.aspose.pdf.ADocument.convert(Unknown Source)
at com.aspose.pdf.Document.convert(Unknown Source)

pls let us know if this is a limitation with Aspose

tilal.ahmad · December 21, 2015, 3:01am

Hi there,

Thanks for your feedback. It is good to know that you are able to test the feature.

Furthermore, we will appreciate it if you please share your problematic PDF document here, we will look into it and guide you accordingly.

We are sorry for the inconvenience caused.

Best Regards,

Hari2506 · December 21, 2015, 3:08am

As requested

tilal.ahmad · December 21, 2015, 10:27pm

Hi there,

Thanks for sharing the source document. I have tested the scenario with Aspose.Pdf for Java 11.0.0 and unable to notice reported exception. It seems its OCR related issue as incorrect data is extracted, please find attached the resultant documents.

We are sorry for the inconvenience caused.

Best Regards,

Hari2506 · December 22, 2015, 1:27am

Hi
Can you please advise what is the version we need to use inorder to achieve this?

You mean only Java 11.0 supports not the earlier version, am using the same OCR suggested by you.

Can you please provide the exact OCR version,ASPOSE version and Java to be used for this?

pls need your reply asap.

Hari2506 · December 22, 2015, 5:33am

Hi Tilal,

We are close but not close enough.

I am attaching my java file for your reference.

After you suggested I have downloaded the following jars and used the same.

1. aspose-ocr-3.0.0.jar

2. aspose.pdf-11.0.0.jar

Can you please incorporate my attached java code and give a try?

I am using the same tesseracct which you have suggested is there any particular version of that to be used , Kindly look into this asap.

Regards

Hari

Hari2506 · December 22, 2015, 5:45am

To add on the Tesseract version that am using is tesseract-ocr-setup-3.02.02 which is the one you have sent in your link.

Also one more thing i comment the piece of code which does deletion of html and jpeg.

But the jpeg has only page.

pls solve this asap

Hari2506 · December 22, 2015, 5:59am

Hi ,

I had created one dummy file it worked, but the one I shared to you earlier didnt work for me.

pls. take a look.

Hari2506 · December 22, 2015, 6:01am

Also the output pdf which you share is still non-searchable.

tilal.ahmad · December 22, 2015, 10:55am

Hi there,

I am sorry for any confusion. As I have already stated above that Aspose.OCR is not mature enough so we recommended to use some other OCR application supporting HOCR standards e.g. tesseract-OCR. So in post(678157) I meant that tesseract-OCR is not reading text from Image of your source PDF correctly that causing issue. You may try some other OCR application for the purpose and use CallBackGetHocr() method for creating Non-Searchable PDF to Searchable PDF.

Furthermore I am using Aspose.Pdf for Java 11.0.0 and tesseract-OCR V3.02 as well.

We are sorry for the inconvenience caused.

P.S: However we have also logged an investigation ticket PDFNEWJAVA-35395, and we will update you if we can suggest you some other solution.

Best Regards,

Hari2506 · December 22, 2015, 7:30pm

But the exception is in inputPDF.convert(cbgh);

So can you please explain why Tesseract is the cause?

tilal.ahmad · December 22, 2015, 11:34pm

Hi there,

Thanks for your feedback. I am afraid I am unable to reproduce the reported exception but incorrect output due to wrong OCR text.

Moreover, as stated above I am using Aspose.Pdf for Java 11.0.0 and tesseract-OCR V3.02 with JDK 1.7. I will appreciate it if you please share some details to replicate the exception issue at my end for further investigation.

We are sorry for the inconvenience caused.

Best Regards,

Hari2506 · December 29, 2015, 11:09pm

Hi Team,

Its a been a while now, Can you please tell whether you were able to convert the pdf I shared to you as a searchable pdf one, because the output you shared to me is still not searchable…

please act on this asap.

We need to know if its possible to do or not?we are using the same specifications that you are using too.

codewarior · December 31, 2015, 1:18am

Hi Harikrishna,

Thanks for sharing the details.

As shared earlier by Tilal in 678410, we were unable to notice any exception when using Aspose.Pdf for Java 11.0.0 and tesseract-OCR V3.02 with JDK 1.7. However as per our observations, tesseract-OCR is extracting incorrect text from image and we already have logged an investigation ticket as PDFNEWJAVA-35395 in our issue tracking system. We will further look into the details of this problem and will keep you posted on the status of correction. Please be patient and spare us little time. We are sorry for this inconvenience.

Hari2506 · February 2, 2016, 7:51pm

Hi Team,

Its been some time now,

Can please let us know the status?

Regards

Hari