Pdf text extraction failure

Hi


I am using Aspose Java Pdf 9.3.0. In one of our machine, the text extraction from pdf fails. The is the stack trace is at the end of this post.

We are using java version “1.7.0_60” and it runs on ubuntu "Linux xxxx 3.2.0-58-generic #88-Ubuntu SMP Tue Dec 3 17:37:58 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux"

But the same code works in Mac OS X and CentOs. How to debug this problem ?

Exception in thread "main" java.lang.NullPointerException

at com.aspose.pdf.internal.p426.z12.m1(Unknown Source)

at com.aspose.pdf.internal.p426.z12.m7(Unknown Source)

at com.aspose.pdf.internal.p426.z14.m1(Unknown Source)

at com.aspose.pdf.internal.p426.z14.m1(Unknown Source)

at com.aspose.pdf.internal.p426.z14.m6(Unknown Source)

at com.aspose.pdf.internal.p426.z14.(Unknown Source)

at com.aspose.pdf.internal.p426.z14.(Unknown Source)

at com.aspose.pdf.TextAbsorber.visit(Unknown Source)

at com.aspose.pdf.facades.PdfExtractor.extractTextInternal(Unknown Source)

at com.aspose.pdf.facades.PdfExtractor.extractText(Unknown Source)

Hi Surendra,


Thanks for your inquiry. We will appreciate it if you please share your sample PDF document and code here. We will test the scenario at our end on Ubuntu server and will update you our findings accordingly.

We are sorry for the inconvenience caused.

Best Regards,

The below is the piece of code that extracts text. The actual pdf is failing. I tried 3 more pdfs and all are failing.


PdfExtractor extractor = new PdfExtractor();

extractor.bindPdf(inputStream);

extractor.extractText(Charset.forName("UTF-8"));

ByteArrayOutputStream baos = new ByteArrayOutputStream();

extractor.getText(baos);

return new String(baos.toByteArray(), "UTF-8");

Hi Surendra,


Thanks for sharing additional information. We are trying to replicate your issue over a Ubuntu server VM and will update you our findings soon.

Best Regards,

Hi Surendra,


Thanks for your patience. We have tested the scenario using Ubuntu 13.10 server with Aspose.Pdf for Java 9.3.1 and unable to replicate the issue. Please download and try latest version of Aspose.Pdf for Java 9.3.1. Hopefully it will resolve the issue. If issue persist then please share your Ubuntu version, So we will try to replicate the issue on specific version.

We are sorry for the inconvenience caused.

Best Regards,

Hi
Thanks for the update. Our Ubuntu version is 12.04.

lsb_release -a

No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 12.04.4 LTS
Release: 12.04
Codename: precise

In the mean time we will try with 9.3.1

Thanks
Suren

Hi

I tried 9.3.1 and it’s also failing with the same error.

Exception in thread “main” java.lang.NullPointerException
at com.aspose.pdf.internal.p428.z12.m1(Unknown Source)
at com.aspose.pdf.internal.p428.z12.m7(Unknown Source)
at com.aspose.pdf.internal.p428.z14.m1(Unknown Source)
at com.aspose.pdf.internal.p428.z14.m1(Unknown Source)
at com.aspose.pdf.internal.p428.z14.m6(Unknown Source)
at com.aspose.pdf.internal.p428.z14.(Unknown Source)
at com.aspose.pdf.internal.p428.z14.(Unknown Source)
at com.aspose.pdf.TextAbsorber.visit(Unknown Source)
at com.aspose.pdf.facades.PdfExtractor.extractTextInternal(Unknown Source)
at com.aspose.pdf.facades.PdfExtractor.extractText(Unknown Source)

Thanks
Suren

Hi Suren,


Thanks for your feedback. I am preparing the
required platform to simulate the environment as of yours. As soon as
everything is setup, I will test the issue on my end and will suggest you
accordingly.

We are sorry for the inconvenience faced.

Best Regards,

Hi

Thanks for the update. Is there anyway i can provide the debug logs ? Basically we have struck in this issue and unable to proceed and our deadlines are fast approaching

Thanks
Suren

Hi Suren,


Thanks for your inquiry. Yes you can share the debug logs in this thread, either copy logs in body of message or attach to thread. We will look into it and guide you accordingly.

We are sorry for the inconvenience caused.

Best Regards,

Hi


My question about debug logs were vague in the previous post. My actual question is how do i take debug logs for this problem ?

Thanks
Suren

Hi Suren,


Thanks for your feedback and sorry for the confusion. Yes Aspose.PDF for Java jar is obfuscated so you can not get any further logs from Aspose.Pdf for Java jar. We are investigated the issue and will keep you updated.

Best Regards,

Hi

Were you able to reproduce the problem ? Kindly let us know the progress and the fix date. This will be a critical aspect in buying the aspose license.

Thanks
Suren

Hi Suren,


We are sorry for the inconvenience. I am afraid the requied VM is not prepared, we are working over it and will update you soon.

Meanwhile can you please install Microsoft fonts and set default locale to English before instantiating Aspose.Pdf obect as following and share the results. As recent past we have noticed some issues with fonts/locale setting.

sudo apt-get install ttf-mscorefonts-installer

Locale.setDefault(Locale.ENGLISH);

Best Regards,

Hi

I tried this and it’s not working. The same exception is thrown.

Thanks
Suren

Hi Suren,


Thanks for your feedback. We are working to replicate the issue and will get back to you soon.

We are sorry for the inconvenience caused.

Best Regards,

Hi suviswan,


I had exactly the same problem. Had to do with the Microsoft True Type fonts not being installed on our Linux server. We are running JBoss application server under Suse Linux. However it looks like not all MTT fonts are needed. I ended up with putting only one font file (copied from C:\windows\fonts) under our JBoss conf directory in a newly created fonts directory. Looks like that is enough for the PDF conversion to start. I have tested several PDF files and they all are correctly converted.

The extra folder was added with the following code:
Document.addLocalFontPath(System.getProperty(“jboss.server.home.dir”) + File.separator + “conf” + File.separator + “fonts”);

Regards,
Joop

Hi Joop

Thanks for the response ! Do you remember which font is it ? I have installed fonts package suggested by “Tilal Ahmad”. But it didn’t solve the problem. If you can remember the specific font, i can try the solution suggested by you.

Thanks
Suren

Hi Suren,


Thanks for your inquiry. You can reference installed Microsoft fonts folder in Linux as following. Hopefully it will help you to accomplish the task

// Set font folder path… folder
path of mscorefonts
<o:p></o:p>

String path = "/usr/share/fonts/truetype/msttcorefonts/";

// Adding a single font directory

com.aspose.pdf.Document.addLocalFontPath(path);

// seting the user list for standard font directories

//java.util.List list = com.aspose.pdf.Document.getLocalFontPaths();

//list.add(path);

com.aspose.pdf.Document pdfDocument = new com.aspose.pdf.Document(inputStream);

PdfExtractor extractor = new PdfExtractor();

extractor.bindPdf(pdfDocument);

extractor.extractText(Charset.forName("UTF-8"));

ByteArrayOutputStream baos = new ByteArrayOutputStream();

extractor.getText(baos);

return new String(baos.toByteArray(), "UTF-8");


Please feel free to contact us for any further assistance.


Best Regards,

Hi

Thanks ! This solved the problem. I will do more testing on our use cases and get back. Thanks again for the help.

Cheers
Suren