Convert Non Searachable PDF to Searchable PDF using Aspose.PDF

codewarior · February 2, 2016, 11:20pm

Hari2506:

Hi Team,

Its been some time now,

Can please let us know the status?

Hi Harikrishna,

Thanks for your patience.

The earlier reported issue is still pending for review as the team has been busy fixing other previously reported issues. Nevertheless, the team will start investigating this problem as per their schedule and as soon as we have some further updates, we will let you know.

Hari2506 · February 9, 2016, 7:24pm

ok. Noted . pls keep us updated.

tilal.ahmad · February 9, 2016, 9:54pm

Hi Hari,

Thanks for your inquiry. I am afraid your above reported issue is still pending for investigation as product team is busy in resolving other issues in the queue, reported earlier. We will notify you as soon as we made some significant progress towards issue resolution.

Thanks for your patience and cooperation.

Best Regards,

Hari2506 · March 18, 2016, 3:10am

hi tilal,

Any update for this issue?

Thanks & Regards
Hari

codewarior · March 20, 2016, 10:09am

Hi Hari,

Thanks for your patience.

I am afraid the issue reported earlier is not yet resolved. However your concerns have been shard with product team and as soon as we have some definite updates, we will let you know.

Hari2506 · April 3, 2016, 10:32pm

Hi Support Team,

My name is Victor, and I’m the Project Manager.

The problem with converting files to searchable PDFs has been logged since 17 Dec 2015.

It has been more than 3 months, and not resolved.

This issue has been escalated as production issue by a Government agency in Singapore. Hope you can expedite the resolution by this week.

Regards.

tilal.ahmad · April 4, 2016, 10:42pm

Hi Victor,

We are sorry for the inconvenience. Please note our product team schedule the issues’ investigation and resolution on first come first basis and we feel this is the fairest and most appropriate way to satisfy the needs of the majority of our customers. .

However, we have recorded your concern, also raised the issue priority of the issue and requested our product team to complete the issue investigation and share a solution/ETA at their earliest. We will keep you updated about the issue resolution progress.

Thanks for your patience and cooperation.

Best Regards,

Hari2506 · April 5, 2016, 12:00am

Hi,

We are vivid supporters of ASPOSE on pdf conversion software, and has been recommending your product to our clients.

This issue has been more than 3 months, and there is no resolution.

Can I escalate the case to priority 1, and please let me know whether the issue can be resolved by tomorrow, as we are meeting up on this progress.

Regards.

Victor Ang.

tilal.ahmad · April 5, 2016, 10:46pm

Hi Victor,

Hari2506:
We are vivid supporters of ASPOSE on pdf conversion software, and has been recommending your product to our clients.

Thanks for the support and proposing Aspose.Pdf to your clients.

Hari2506:

This issue has been more than 3 months, and there is no resolution.

Can I escalate the case to priority 1, and please let me know whether the issue can be resolved by tomorrow, as we are meeting up on this progress.

We have raised the issue priority of your issue within free support and requested product team to review issue and share their findings/ETA as soon as possible.

Furthermore regarding escalating the issue priority further, you may purchase paid support , Aspose offers three different types of enhanced support. Please check following links for the details/prices and enhanced support FAQs.

Priority Support
Please note purchasing the paid support does not guarantee any immediate resolution of the issue but it increases issue precedence over normal support and development team start investigating the issue on priority basis. Purchasing paid support will definitely raise your issue priority but resolution or hotfix is subject to the complexity and priority of issue.

Furthermore, If you have any query related the enhanced support please post here. We will be more than happy to help you.

We are sorry for the inconvenience caused.

Best Regards,

Hari2506 · April 5, 2016, 11:50pm

Hi ASPOSE Support,

I have attached 4 documents where the text were extracted from the non-searchable pdf by Google tesseract (out.html).

However, during the creation of searchable pdf, we have met with ASPOSE exception (ASPOSEexception.txt).

I have attached the files for your reference.

Please comment on these specific instances of the issues.

Thanks.

tilal.ahmad · April 6, 2016, 11:17pm

Hi there,

Thanks for your inquriy. I have tested the sceneario with Aspose.Pdf for Java 11.3.0 and JDK 1.7 on Win 7 64 bit using following code snippet, but I am afraid I am unable to replicate the exception. Please find attached searchable PDF output for reference. Please share some more details to replicate the issue.

Document doc = new Document(“C:\Users\Home\Downloads\docs_withexception\docs_withexception\docs\IMG2_150dpi\IMG2_150dpi.pdf”);<o:p></o:p>

// Create callBack - logic recognize text for pdf images. Use outer OCR supports HOCR standard(http://en.wikipedia.org/wiki/HOCR).

// We have used free google tesseract OCR(http://en.wikipedia.org/wiki/Tesseract_%28software%29)

CallBackGetHocr cbgh = new CallBackGetHocr()

{

@Override

public String invoke(java.awt.image.BufferedImage img)

{

File outputfile = new File(myDir + "test.jpg");

try

{

ImageIO.write(img, "jpg", outputfile);

} catch (IOException e1)

{

e1.printStackTrace();

}

try

{

java.lang.Process process = Runtime.getRuntime()

.exec("C:/Program Files (x86)/Tesseract-OCR/tesseract" + " " + myDir + "test.jpg" + " " + myDir + "out hocr");

System.out.println("tesseract" + " " + myDir + "test.jpg" + " " + myDir + "out hocr");

process.waitFor();

} catch (IOException e)

{

e.printStackTrace();

} catch (InterruptedException e)

{

e.printStackTrace();

}

// reading out.html to string

File file = new File(myDir + "out.html");

StringBuilder fileContents = new StringBuilder((int) file.length());

Scanner scanner = null;

try

{

scanner = new Scanner(file);

String lineSeparator = System.getProperty("line.separator");

while (scanner.hasNextLine())

{

fileContents.append(scanner.nextLine() + lineSeparator);

}

} catch (FileNotFoundException e)

{

e.printStackTrace();

} finally

{

if (scanner != null)

scanner.close();

}

return fileContents.toString();

}

};

// End callBack

doc.convert(cbgh);

doc.save(myDir + "IMG2_dpi_output.pdf");

We are sorry for the inconvenience caused.

Best Regards,

Hari2506 · April 13, 2016, 1:14am

Hi Tilal,

We are also same configuration only,

JDK 1.7

Windows 7, 64 bit

Aspose PDF 11.3.0

Same snippet.

For your reference all the configuration attached here, but same exception is occurred.

Thanks & Regards

Hari

tilal.ahmad · April 14, 2016, 1:04am

Hi Hari,

Thanks for sharing your sample code. I have tested the code with minor amendment, changing google tesseract OCR path as following and unable to notice the reported exception. Can you please check the google tesseract OCR version? I am using tesseract-OCR V3.02, might be its version is the issue.

…<o:p></o:p>

java.lang.Process process = Runtime.getRuntime().exec(

"C:/Program Files (x86)/Tesseract-OCR/tesseract" + " " + myDirOcr

+ "test.jpg" + " " + myDirOcr + "out hocr");

....

We are truly sorry for the inconvenience caused.

Best Regards,

Hari2506 · April 14, 2016, 1:57am

Hi Tilal,

We are used same tessaract version and attached here, FYR.

Thanks

Hari.

tilal.ahmad · April 15, 2016, 12:33am

Hi Hari,

Thanks for your feedback. It is difficult to suggest you any thing without replicate the issue at oure end. However, I have logged an investigation ticket PDFNEWJAVA-35737 in our issue tracking system and requested our product team to have a look into it and suggest accordingly. We will update you as soon as we get a feedback.

Meanwhile, we will appreciate it if you please test the scenario on some other machine and share the results.

We are sorry for the inconvenience caused.

Best Regards,

codewarior · April 17, 2016, 4:11am

Hi Hari,

Thanks for your patience.

We are pleased to share that the issue PDFNEWJAVA-35395 reported earlier is resolved in latest hotfix over Aspose.Pdf for Java 11.4.1. However please note that extracted text by tesseract-OCR can be incorrect due to quality of the image or used specific fonts.

The text can be reviewed and edited before invoke method return. Also, we implemented an additional method where were able to make the searchable text visible. This can be useful for error checking on the text. Implemented the following method in com.aspose.pdf.Document:

public boolean convert(Document.CallBackGetHocr callback,
boolean isTestVisible)

For your reference, I have also attached the output generated over my end.

Hari2506 · April 17, 2016, 10:29pm

Hi Nayyer Shahbaz,

I try to download for latest 11.4.1 version, but it is not available in download page, PFA.

Thanks

Hari

tilal.ahmad · April 17, 2016, 10:59pm

Hi Hari,

Thanks for your inquriy. Please note the Aspose.Pdf for Java 11.4.1 hotfix is not published in download section as it includes some specific fixes. Please use dropbox link shared in above post to download the hotfix.

Best Regards,

Hari2506 · April 19, 2016, 12:29am

Hi,

i tried to dropbox link, but always site is not open. PFA.

Regards

Hari

tilal.ahmad · April 19, 2016, 12:15pm

Hi Hari,

Thanks for your inquriy. We have again tested the shared dropbox link, it is working fine as expected and some other customers had downloaded the hotfix successfully as well. Please try to download again it should work. Please double check your internet setting and you may try to switch the connection.

We are sorry for the inconvenience caused.

Best Regards,