Convert Non Searachable PDF to Searchable PDF using Aspose.PDF

Hi Team,

I want to convert non-searchable pdf to searchable pdf , please let us know how we can achieve this.
We already have aspose with us.

pls help us on this asap.

Hi there,


Thanks for your inquiry. I’m afraid currently searchable PDF is not supported with Aspose components. As Aspose.Ocr is not quite mature. We are facing some issue in text recognition accuracy and its coordinates. Our development team is working hard to fix these issue and investigating some new algorithms for the purpose.

As a workaround you can create a searchable PDF document form image using Aspose.Pdf with collaboration of some other OCR application supporting HOCR standards. You can use free google tesseract OCR for the purpose. Please check following documentation link for details.


Please feel free to contact us for any further assistance.

Best Regards,

Hi Tilal,


Thanks for your reply , Can you explain what is this class? Do I need to put that piece of code in main method of Java.
pls provide how to fit that piece of code in our java class?


Below is my code pls have a look , but i dont see the CallBackGetHOCR being called, what is going wrong, pls let me know where to correct

import java.io.File;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.util.Scanner;

import javax.imageio.ImageIO;

import com.aspose.pdf.Document;
import com.aspose.pdf.Document.CallBackGetHocr;

public class asadad {


public static void main (String args[]){
Document doc = new Document("C:\\Users\\harikrishna\\Desktop\\Datacap9.0\\NSC.pdf");
final String myDir = "C:\\Users\\harikrishna\\Desktop\\Datacap9.0\\";
doc.convert(new CallBackGetHocr()
{
@Override
public String invoke(java.awt.image.BufferedImage img)
{
File outputfile = new File(myDir + "AA.jpg");
try
{
ImageIO.write(img, "jpg", outputfile);
} catch (IOException e1)
{
e1.printStackTrace();
}

try
{
System.out.println("tesseract" + " " + myDir + "AA.jpg" + " " + myDir + "out hocr");
java.lang.Process process = Runtime.getRuntime()
.exec("tesseract" + " " + myDir + "AA.jpg" + " " + myDir + "out hocr");
process.waitFor();

} catch (IOException e)
{
e.printStackTrace();
} catch (InterruptedException e)
{
e.printStackTrace();
}

// reading out.html to string
File file = new File(myDir + "out.html");
StringBuilder fileContents = new StringBuilder((int) file.length());
Scanner scanner = null;
try
{
scanner = new Scanner(file);
String lineSeparator = System.getProperty("line.separator");

while (scanner.hasNextLine())
{
fileContents.append(scanner.nextLine() + lineSeparator);
}
} catch (FileNotFoundException e)
{
e.printStackTrace();
} finally
{
if (scanner != null)
scanner.close();
}

// deleting temp files
File fileOut = new File(myDir + "out.html");
if (fileOut.exists())
{
fileOut.delete();
}
File fileTest = new File(myDir + "test.jpg");
if (fileTest.exists())
{
fileTest.delete();
}


return fileContents.toString();
}
});
doc.save("c:/pdftest/aaa.pdf");;
}
}

Hi there,


Thanks for your feedback. I have again tested the scenario at my end ,it is working fine.

Please double check whether tesseract is installed on your system. You can check this by entering “tesseract help” at command prompt.

We are sorry for the inconvenience caused.

Best Regards,

Thanks…

Sorted the issue and am able to make simple pdf ,
however I have another pdf with images and am getting below exception.

Exception in thread “main” class com.aspose.pdf.internal.p386.z352: unexpected end of file in an attribute value Line 14, position 1027.
com.aspose.pdf.internal.p386.z554.m16(Unknown Source)
com.aspose.pdf.internal.p386.z554.m14(Unknown Source)
com.aspose.pdf.internal.p386.z554.m9(Unknown Source)
com.aspose.pdf.internal.p386.z554.m84(Unknown Source)
com.aspose.pdf.internal.p386.z554.m82(Unknown Source)
com.aspose.pdf.internal.p386.z554.m82(Unknown Source)
com.aspose.pdf.internal.p386.z554.m8(Unknown Source)
com.aspose.pdf.internal.p386.z553.m8(Unknown Source)
com.aspose.pdf.internal.p386.z320.m4(Unknown Source)
com.aspose.pdf.internal.p386.z320.m4(Unknown Source)
com.aspose.pdf.internal.p386.z320.m4(Unknown Source)
com.aspose.pdf.internal.p386.z320.m4(Unknown Source)
com.aspose.pdf.internal.p386.z320.m4(Unknown Source)
com.aspose.pdf.internal.p386.z320.m4(Unknown Source)
com.aspose.pdf.internal.p386.z320.m4(Unknown Source)
com.aspose.pdf.internal.p386.z320.m2(Unknown Source)
com.aspose.pdf.internal.p386.z320.m1(Unknown Source)
com.aspose.pdf.internal.p386.z320.m17(Unknown Source)
com.aspose.pdf.internal.p626.z5.m1(Unknown Source)
com.aspose.pdf.internal.p626.z5.m1(Unknown Source)
com.aspose.pdf.ADocument.convert(Unknown Source)
com.aspose.pdf.Document.convert(Unknown Source)
sg.gov.sla.ers.util.OCRTesting.main(OCRTesting.java:87)
at com.aspose.pdf.internal.p386.z554.m16(Unknown Source)
at com.aspose.pdf.internal.p386.z554.m14(Unknown Source)
at com.aspose.pdf.internal.p386.z554.m9(Unknown Source)
at com.aspose.pdf.internal.p386.z554.m84(Unknown Source)
at com.aspose.pdf.internal.p386.z554.m82(Unknown Source)
at com.aspose.pdf.internal.p386.z554.m82(Unknown Source)
at com.aspose.pdf.internal.p386.z554.m8(Unknown Source)
at com.aspose.pdf.internal.p386.z553.m8(Unknown Source)
at com.aspose.pdf.internal.p386.z320.m4(Unknown Source)
at com.aspose.pdf.internal.p386.z320.m4(Unknown Source)
at com.aspose.pdf.internal.p386.z320.m4(Unknown Source)
at com.aspose.pdf.internal.p386.z320.m4(Unknown Source)
at com.aspose.pdf.internal.p386.z320.m4(Unknown Source)
at com.aspose.pdf.internal.p386.z320.m4(Unknown Source)
at com.aspose.pdf.internal.p386.z320.m4(Unknown Source)
at com.aspose.pdf.internal.p386.z320.m2(Unknown Source)
at com.aspose.pdf.internal.p386.z320.m1(Unknown Source)
at com.aspose.pdf.internal.p386.z320.m17(Unknown Source)
at com.aspose.pdf.internal.p626.z5.m1(Unknown Source)
at com.aspose.pdf.internal.p626.z5.m1(Unknown Source)
at com.aspose.pdf.ADocument.convert(Unknown Source)
at com.aspose.pdf.Document.convert(Unknown Source)

pls let us know if this is a limitation with Aspose

Hi there,


Thanks for your feedback. It is good to know that you are able to test the feature.

Furthermore, we will appreciate it if you please share your problematic PDF document here, we will look into it and guide you accordingly.

We are sorry for the inconvenience caused.

Best Regards,

As requested

Hi there,


Thanks for sharing the source document. I have tested the scenario with Aspose.Pdf for Java 11.0.0 and unable to notice reported exception. It seems its OCR related issue as incorrect data is extracted, please find attached the resultant documents.

We are sorry for the inconvenience caused.

Best Regards,

Hi
Can you please advise what is the version we need to use inorder to achieve this?

You mean only Java 11.0 supports not the earlier version, am using the same OCR suggested by you.

Can you please provide the exact OCR version,ASPOSE version and Java to be used for this?

pls need your reply asap.

Hi Tilal,


We are close but not close enough.

I am attaching my java file for your reference.

After you suggested I have downloaded the following jars and used the same.

1. aspose-ocr-3.0.0.jar
2. aspose.pdf-11.0.0.jar

Can you please incorporate my attached java code and give a try?
I am using the same tesseracct which you have suggested is there any particular version of that to be used , Kindly look into this asap.

Regards
Hari

To add on the Tesseract version that am using is tesseract-ocr-setup-3.02.02 which is the one you have sent in your link.


Also one more thing i comment the piece of code which does deletion of html and jpeg.
But the jpeg has only page.

pls solve this asap

Hi ,


I had created one dummy file it worked, but the one I shared to you earlier didnt work for me.

pls. take a look.

Also the output pdf which you share is still non-searchable.

Hi there,


I am sorry for any confusion. As I have already stated above that Aspose.OCR is not mature enough so we recommended to use some other OCR application supporting HOCR standards e.g. tesseract-OCR. So in post(678157) I meant that tesseract-OCR is not reading text from Image of your source PDF correctly that causing issue. You may try some other OCR application for the purpose and use CallBackGetHocr() method for creating Non-Searchable PDF to Searchable PDF.

Furthermore I am using Aspose.Pdf for Java 11.0.0 and tesseract-OCR V3.02 as well.

We are sorry for the inconvenience caused.

P.S: However we have also logged an investigation ticket PDFNEWJAVA-35395, and we will update you if we can suggest you some other solution.

Best Regards,

But the exception is in inputPDF.convert(cbgh);

So can you please explain why Tesseract is the cause?

Hi there,


Thanks for your feedback. I am afraid I am unable to reproduce the reported exception but incorrect output due to wrong OCR text.

Moreover, as stated above I am using Aspose.Pdf for Java 11.0.0 and tesseract-OCR V3.02 with JDK 1.7. I will appreciate it if you please share some details to replicate the exception issue at my end for further investigation.

We are sorry for the inconvenience caused.

Best Regards,

Hi Team,


Its a been a while now, Can you please tell whether you were able to convert the pdf I shared to you as a searchable pdf one, because the output you shared to me is still not searchable…

please act on this asap.

We need to know if its possible to do or not?we are using the same specifications that you are using too.

Hi Harikrishna,


Thanks for sharing the details.

As shared earlier by Tilal in 678410, we were unable to notice any exception when using Aspose.Pdf for Java 11.0.0 and tesseract-OCR V3.02 with JDK 1.7. However as per our observations, tesseract-OCR is extracting incorrect text from image and we already have logged an investigation ticket as PDFNEWJAVA-35395 in our issue tracking system. We will further look into the details of this problem and will keep you posted on the status of correction. Please be patient and spare us little time. We are sorry for this inconvenience.

Hi Team,


Its been some time now,
Can please let us know the status?

Regards
Hari