Convert Non Searachable PDF to Searchable PDF using Aspose.PDF

Hari2506:
Hi Team,

Its been some time now,
Can please let us know the status?
Hi Harikrishna,

Thanks for your patience.

The earlier reported issue is still pending for review as the team has been busy fixing other previously reported issues. Nevertheless, the team will start investigating this problem as per their schedule and as soon as we have some further updates, we will let you know.

ok. Noted . pls keep us updated.

Hi Hari,


Thanks for your inquiry. I am afraid your above reported issue is still pending for investigation as product team is busy in resolving other issues in the queue, reported earlier. We will notify you as soon as we made some significant progress towards issue resolution.

Thanks for your patience and cooperation.

Best Regards,

hi tilal,

Any update for this issue?

Thanks & Regards
Hari

Hi Hari,


Thanks for your patience.

I am afraid the issue reported earlier is not yet resolved. However your concerns have been shard with product team and as soon as we have some definite updates, we will let you know.

Hi Support Team,

My name is Victor, and I’m the Project Manager.
The problem with converting files to searchable PDFs has been logged since 17 Dec 2015.

It has been more than 3 months, and not resolved.

This issue has been escalated as production issue by a Government agency in Singapore. Hope you can expedite the resolution by this week.

Regards.

Hi Victor,


We are sorry for the inconvenience. Please note our product team schedule the issues’ investigation and resolution on first come first basis and we feel this is the fairest and most appropriate way to satisfy the needs of the majority of our customers. .

However, we have recorded your concern, also raised the issue priority of the issue and requested our product team to complete the issue investigation and share a solution/ETA at their earliest. We will keep you updated about the issue resolution progress.

Thanks for your patience and cooperation.

Best Regards,

Hi,

We are vivid supporters of ASPOSE on pdf conversion software, and has been recommending your product to our clients.

This issue has been more than 3 months, and there is no resolution.

Can I escalate the case to priority 1, and please let me know whether the issue can be resolved by tomorrow, as we are meeting up on this progress.

Regards.

Victor Ang.
Hi Victor,

Hari2506:
We are vivid supporters of ASPOSE on pdf conversion software, and has been recommending your product to our clients.


Thanks for the support and proposing Aspose.Pdf to your clients.

Hari2506:

This issue has been more than 3 months, and there is no resolution.

Can I escalate the case to priority 1, and please let me know whether the issue can be resolved by tomorrow, as we are meeting up on this progress.


We have raised the issue priority of your issue within free support and requested product team to review issue and share their findings/ETA as soon as possible.

Furthermore regarding escalating the issue priority further, you may purchase paid support , Aspose offers three different types of enhanced support. Please check following links for the details/prices and enhanced support FAQs.

  • Priority Support

    Please note purchasing the paid support does not guarantee any immediate resolution of the issue but it increases issue precedence over normal support and development team start investigating the issue on priority basis. Purchasing paid support will definitely raise your issue priority but resolution or hotfix is subject to the complexity and priority of issue.

    Furthermore, If you have any query related the enhanced support please post here. We will be more than happy to help you.


    We are sorry for the inconvenience caused.


    Best Regards,

Hi ASPOSE Support,

I have attached 4 documents where the text were extracted from the non-searchable pdf by Google tesseract (out.html).



However, during the creation of searchable pdf, we have met with ASPOSE exception (ASPOSEexception.txt).



I have attached the files for your reference.



Please comment on these specific instances of the issues.



Thanks.

Hi there,


Thanks for your inquriy. I have tested the sceneario with Aspose.Pdf for Java 11.3.0 and JDK 1.7 on Win 7 64 bit using following code snippet, but I am afraid I am unable to replicate the exception. Please find attached searchable PDF output for reference. Please share some more details to replicate the issue.

Document doc = new Document(“C:\Users\Home\Downloads\docs_withexception\docs_withexception\docs\IMG2_150dpi\IMG2_150dpi.pdf”);<o:p></o:p>

// Create callBack - logic recognize text for pdf images. Use outer OCR supports HOCR standard(http://en.wikipedia.org/wiki/HOCR).

// We have used free google tesseract OCR(http://en.wikipedia.org/wiki/Tesseract_%28software%29)

CallBackGetHocr cbgh = new CallBackGetHocr()

{

@Override

public String invoke(java.awt.image.BufferedImage img)

{

File outputfile = new File(myDir + "test.jpg");

try

{

ImageIO.write(img, "jpg", outputfile);

} catch (IOException e1)

{

e1.printStackTrace();

}

try

{

java.lang.Process process = Runtime.getRuntime()

.exec("C:/Program Files (x86)/Tesseract-OCR/tesseract" + " " + myDir + "test.jpg" + " " + myDir + "out hocr");

System.out.println("tesseract" + " " + myDir + "test.jpg" + " " + myDir + "out hocr");

process.waitFor();

} catch (IOException e)

{

e.printStackTrace();

} catch (InterruptedException e)

{

e.printStackTrace();

}

// reading out.html to string

File file = new File(myDir + "out.html");

StringBuilder fileContents = new StringBuilder((int) file.length());

Scanner scanner = null;

try

{

scanner = new Scanner(file);

String lineSeparator = System.getProperty("line.separator");

while (scanner.hasNextLine())

{

fileContents.append(scanner.nextLine() + lineSeparator);

}

} catch (FileNotFoundException e)

{

e.printStackTrace();

} finally

{

if (scanner != null)

scanner.close();

}

return fileContents.toString();

}

};

// End callBack

doc.convert(cbgh);

doc.save(myDir + "IMG2_dpi_output.pdf");


We are sorry for the inconvenience caused.

Best Regards,

Hi Tilal,


We are also same configuration only,

JDK 1.7
Windows 7, 64 bit
Aspose PDF 11.3.0
Same snippet.

For your reference all the configuration attached here, but same exception is occurred.

Thanks & Regards
Hari

Hi Hari,


Thanks for sharing your sample code. I have tested the code with minor amendment, changing google tesseract OCR path as following and unable to notice the reported exception. Can you please check the google tesseract OCR version? I am using tesseract-OCR V3.02, might be its version is the issue.

…<o:p></o:p>

java.lang.Process process = Runtime.getRuntime().exec(

"C:/Program Files (x86)/Tesseract-OCR/tesseract" + " " + myDirOcr

+ "test.jpg" + " " + myDirOcr + "out hocr");

....


We are truly sorry for the inconvenience caused.

Best Regards,

Hi Tilal,


We are used same tessaract version and attached here, FYR.

Thanks
Hari.

Hi Hari,


Thanks for your feedback. It is difficult to suggest you any thing without replicate the issue at oure end. However, I have logged an investigation ticket PDFNEWJAVA-35737 in our issue tracking system and requested our product team to have a look into it and suggest accordingly. We will update you as soon as we get a feedback.

Meanwhile, we will appreciate it if you please test the scenario on some other machine and share the results.

We are sorry for the inconvenience caused.

Best Regards,

Hi Hari,


Thanks for your patience.

We are pleased to share that the issue PDFNEWJAVA-35395 reported earlier is resolved in latest hotfix over Aspose.Pdf for Java 11.4.1. However please note that extracted text by tesseract-OCR can be incorrect due to quality of the image or used specific fonts.

The text can be reviewed and edited before invoke method return. Also, we implemented an additional method where were able to make the searchable text visible. This can be useful for error checking on the text. Implemented the following method in com.aspose.pdf.Document:

public boolean convert(Document.CallBackGetHocr callback,
boolean isTestVisible)

For your reference, I have also attached the output generated over my end.

Hi Nayyer Shahbaz,


I try to download for latest 11.4.1 version, but it is not available in download page, PFA.

Thanks
Hari

Hi Hari,


Thanks for your inquriy. Please note the Aspose.Pdf for Java 11.4.1 hotfix is not published in download section as it includes some specific fixes. Please use dropbox link shared in above post to download the hotfix.

Best Regards,

Hi,


i tried to dropbox link, but always site is not open. PFA.

Regards
Hari


Hi Hari,


Thanks for your inquriy. We have again tested the shared dropbox link, it is working fine as expected and some other customers had downloaded the hotfix successfully as well. Please try to download again it should work. Please double check your internet setting and you may try to switch the connection.

We are sorry for the inconvenience caused.

Best Regards,