NullPointerException when attempting to get all text from PDF

Hello,


Recently we’ve updated our aspose-pdf version to use the latest version in the aspose maven repository (9.5.0). I’m attempting to retrieve all text from a given PDF file. When running the code below, I receive a NullPointerException somewhere in aspose’s code.

The code I’ve written is as follows:

// Create a new Aspose PDF Document object
Document document = new Document(new FileInputStream(file));

// Instantiate Aspose’s TextAbsorber which will allow us to read the PDF’s text
TextAbsorber textAbsorber = new TextAbsorber();

/*
* Assign all pages of the document to the TextAbsorber.
*
* We have the option to assign only a subset of pages if we want to via the
* document.getPages().get_Item(PAGE_NUMBER).accept(TEXT_ABSORBER) method chain.
*/
document.getPages().accept(textAbsorber);

// Grab the content
String content = textAbsorber.getText();

The error occurs on the call to:

document.getPages().accept(textAbsorber);

The exception:

java.lang.NullPointerException: null
at com.aspose.pdf.internal.p435.z12.m2(Unknown Source) ~[aspose-pdf-9.5.0-jdk16.jar:9.5.0]
at com.aspose.pdf.internal.p435.z12.m7(Unknown Source) ~[aspose-pdf-9.5.0-jdk16.jar:9.5.0]
at com.aspose.pdf.internal.p435.z14.m1(Unknown Source) ~[aspose-pdf-9.5.0-jdk16.jar:9.5.0]
at com.aspose.pdf.internal.p435.z14.m1(Unknown Source) ~[aspose-pdf-9.5.0-jdk16.jar:9.5.0]
at com.aspose.pdf.internal.p435.z14.m6(Unknown Source) ~[aspose-pdf-9.5.0-jdk16.jar:9.5.0]
at com.aspose.pdf.internal.p435.z14.(Unknown Source) ~[aspose-pdf-9.5.0-jdk16.jar:9.5.0]
at com.aspose.pdf.internal.p435.z14.(Unknown Source) ~[aspose-pdf-9.5.0-jdk16.jar:9.5.0]
at com.aspose.pdf.TextAbsorber.visit(Unknown Source) ~[aspose-pdf-9.5.0-jdk16.jar:9.5.0]
at com.aspose.pdf.Page.accept(Unknown Source) ~[aspose-pdf-9.5.0-jdk16.jar:9.5.0]
at com.aspose.pdf.PageCollection.accept(Unknown Source) ~[aspose-pdf-9.5.0-jdk16.jar:9.5.0]

We’ve confirmed that the file is not null when passing it to the aspose-pdf services. This issue ONLY occurs when running unit tests in bamboo. When I run the unit tests locally on my mac, the issue does not occur. Could this possibly be related to some font files missing on my build agent?

I’ve attached the specific file that our unit test uses. Our bamboo build agents run on either Ubuntu 14.04.1 or 13.10.

Thank you!

Hi Adam,


Thanks for contacting support.

In my initial attempt, I have tested the scenario using Aspose.Pdf for Java 9.5.0 using Eclipse Juno project running over Windows 7 (x64) with JDK 1.7 and I am unable to notice any issue. However we are working on preparing required environment i.e. Bamboo over Ubuntu and will keep you posted with our findings. We are sorry for this inconvenience.

I can confirm this issue which happens to aspose words converted docx files and trying to read them afterwards with aspose pdf.
It happens to about all documents i put in - so its maybe related to font embedding.
saving code of aspose words is not too magic:

// update document fields
document.updateFields();
document.updateListLabels();
document.updateTableLayout();
document.updatePageLayout();
document.updateWordCount();
FontSettings.setDefaultFontName(“Droid Sans Fallback”);
com.aspose.words.PdfSaveOptions opts = new com.aspose.words.PdfSaveOptions();
opts.setWarningCallback(new WordsWarningCallback("",""));
//opts.setEmbedFullFonts(true);
document.save(tempFile.getAbsolutePath(), opts);
read of the text not either:
// set text extraction options - set text extraction mode (Raw or Pure)
com.aspose.pdf.TextExtractionOptions textExtOptions = new com.aspose.pdf.TextExtractionOptions(com.aspose.pdf.TextExtractionOptions.TextFormattingMode.Pure);
com.aspose.pdf.devices.TextDevice txtDevice = new com.aspose.pdf.devices.TextDevice(textExtOptions);
txtDevice.setEncoding(Charset.forName(“UTF-8”));
// convert a particular page and save the image to stream
txtDevice.process(document.getPages().get_Item(convertPage), bos);
Exception is:
java.lang.NullPointerException
at com.aspose.pdf.internal.p435.z12.m2(Unknown Source)
at com.aspose.pdf.internal.p435.z12.m7(Unknown Source)
at com.aspose.pdf.internal.p435.z14.m1(Unknown Source)
at com.aspose.pdf.internal.p435.z14.m1(Unknown Source)
at com.aspose.pdf.internal.p435.z14.m6(Unknown Source)
at com.aspose.pdf.internal.p435.z14.(Unknown Source)
at com.aspose.pdf.internal.p435.z14.(Unknown Source)
at com.aspose.pdf.TextAbsorber.visit(Unknown Source)
at com.aspose.pdf.devices.TextDevice.processInternal(Unknown Source)
at com.aspose.pdf.devices.TextDevice.process(Unknown Source)
java.vm.name: Java HotSpot™ 64-Bit Server VM
java.vm.vendor: Oracle Corporation
java.vm.version: 24.60-b09
java.runtime.name: Java™ SE Runtime Environment
java.runtime.version:1.7.0_60-b19
os.name: Linux
os.arch: amd64
java.io.tmpdir: /tmp
file.encoding: UTF-8
sun.io.unicode.encoding:UnicodeLittle
sun.cpu.endian: little
Available processors (cores):8
Free memory (bytes):786217096
Maximum memory (bytes):8572502016
Total memory available to JVM (bytes):1056309248
wordsVersion: 14.9.0.0
cellsVersion: 8.2.1.0
slidesVersion: 14.7.0.0
pdfVersion: 9.5.0.0

Hi Adam,

Thanks for sharing the details.

I have logged the above stated * problem in our issue tracking system as PDFNEWJAVA-34523. We will investigate this issue in details and will keep you updated on the status of a correction.

We apologize for your inconvenience.

I have the same problem as well!



smcduff:
I have the same problem as well!
Hi Simon,

Thanks for contacting support.

Can you please share the code snippet and resource PDF files which can help us in replicating the problem in our environment. We are sorry for this inconvenience.

Hi Like I said the same method from the other post


for (int i = 1; i <= pages.size(); i++)
{
Page currentPage = pages.get_Item(i);

com.aspose.pdf.TextFragmentAbsorber textAbsorber = new com.aspose.pdf.TextFragmentAbsorber();

currentPage.accept(textAbsorber);


It does that for all PDF files.
Can we have an estimate date ?

Simon

smcduff:
Hi Like I said the same method from the other post

for (int i = 1; i <= pages.size(); i++)
{
Page currentPage = pages.get_Item(i);

 com.aspose.pdf.TextFragmentAbsorber textAbsorber =
   new com.aspose.pdf.TextFragmentAbsorber();

  currentPage.accept(textAbsorber);

It does that for all PDF files. Can we have an estimate date ?

Hi Simon,

During my testing with following code snippet where I have used one of my sample PDF files, I am unable to notice any problem. However the customer who initiated this thread had an issue while extracting text from PDF file but the problem depends upon the structure and complexity of source/input PDF file. In order for us to test the scenario and to investigate the reasons of problem which you are facing, we request you to please share your source PDF files.

For your reference, I have attached my sample PDF document.

[Java]

//Create a new Aspose PDF Document object
com.aspose.pdf.Document document = new com.aspose.pdf.Document(new FileInputStream("c:/pdftest/UnderLineFormatting.pdf"));

//text file in which extracted text will be saved
java.io.OutputStream text_stream = new java.io.FileOutputStream("c:/pdftest/ExtractedText.txt", false);

//iterate through all the pages of PDF file
for(com.aspose.pdf.Page page : (Iterable<com.aspose.pdf.Page>)document.getPages()){
    //set text extraction options - set text extraction mode (Raw or Pure)
    com.aspose.pdf.TextExtractionOptions textExtOptions = new com.aspose.pdf.TextExtractionOptions(com.aspose.pdf.TextExtractionOptions.TextFormattingMode.Pure);
    com.aspose.pdf.devices.TextDevice txtDevice = new com.aspose.pdf.devices.TextDevice(textExtOptions);
    txtDevice.setEncoding(java.nio.charset.Charset.forName("UTF-8"));
    txtDevice.setExtractionOptions(textExtOptions);

    //convert a particular page and save the image to stream
    //txtDevice.process(document.getPages().get_Item(23), "c:/pdftest/ExtractedText.txt");

    //get the text from pages of PDF and save it to OutputStream object
    txtDevice.process(page, text_stream);
}
//close stream object
text_stream.close();

Hi,


Here one example of the file.

Can you confirm that you can reproduce it ?

Thank you

Simon

Still receiving the same issue in our build environment with your attached test PDF file:


build	11-Nov-2014 16:26:40	java.lang.NullPointerException: null
build 11-Nov-2014 16:26:40 at com.aspose.pdf.internal.p440.z12.m2(Unknown Source) ~[aspose-pdf-9.5.2-jdk16.jar:9.5.2]
build 11-Nov-2014 16:26:40 at com.aspose.pdf.internal.p440.z12.m7(Unknown Source) ~[aspose-pdf-9.5.2-jdk16.jar:9.5.2]
build 11-Nov-2014 16:26:40 at com.aspose.pdf.internal.p440.z14.m1(Unknown Source) ~[aspose-pdf-9.5.2-jdk16.jar:9.5.2]
build 11-Nov-2014 16:26:40 at com.aspose.pdf.internal.p440.z14.m1(Unknown Source) ~[aspose-pdf-9.5.2-jdk16.jar:9.5.2]
build 11-Nov-2014 16:26:40 at com.aspose.pdf.internal.p440.z14.m6(Unknown Source) ~[aspose-pdf-9.5.2-jdk16.jar:9.5.2]
build 11-Nov-2014 16:26:40 at com.aspose.pdf.internal.p440.z14.(Unknown Source) ~[aspose-pdf-9.5.2-jdk16.jar:9.5.2]
build 11-Nov-2014 16:26:40 at com.aspose.pdf.internal.p440.z14.(Unknown Source) ~[aspose-pdf-9.5.2-jdk16.jar:9.5.2]
build 11-Nov-2014 16:26:40 at com.aspose.pdf.TextAbsorber.visit(Unknown Source) ~[aspose-pdf-9.5.2-jdk16.jar:9.5.2]
build 11-Nov-2014 16:26:40 at com.aspose.pdf.Page.accept(Unknown Source) ~[aspose-pdf-9.5.2-jdk16.jar:9.5.2]
build 11-Nov-2014 16:26:40 at com.aspose.pdf.PageCollection.accept(Unknown Source) ~[aspose-pdf-9.5.2-jdk16.jar:9.5.2]
build 11-Nov-2014 16:26:40 at

Hi Simon,

Thanks for sharing the resource file.

I have tested the scenario using Aspose.Pdf for Java 9.5.2 in Eclipse Juno application running over Windows 7 (x64) with JDK 1.7 and I am unable to notice any problem. In another attempt, I have used following code snippet over Red Hat Enterprise Linux 5.6 with JDK 1.7.0_71 and I did not encounter any problem. For your reference, I have also attached the resultant file with contents extracted from PDF document.

Can you please share some details regarding your working environment so that we can further look into this matter.

Java

//Create a new Aspose PDF Document object
com.aspose.pdf.Document document = new com.aspose.pdf.Document(new java.io.FileInputStream("c:/pdftest/form1.pdf"));

//Text file in which extracted text will be saved
java.io.OutputStream text_stream = new java.io.FileOutputStream("c:/pdftest/ExtractedText.txt", false);

//Iterate through all the pages of PDF file
for(com.aspose.pdf.Page page : (Iterable<com.aspose.pdf.Page>)document.getPages()){
    //Set text extraction options - set text extraction mode (Raw or Pure)
    com.aspose.pdf.TextExtractionOptions textExtOptions = new com.aspose.pdf.TextExtractionOptions(com.aspose.pdf.TextExtractionOptions.TextFormattingMode.Pure);

    com.aspose.pdf.devices.TextDevice txtDevice = new com.aspose.pdf.devices.TextDevice(textExtOptions);
    txtDevice.setEncoding(java.nio.charset.Charset.forName("UTF-8"));
    txtDevice.setExtractionOptions(textExtOptions);

    //Convert a particular page and save the image to stream
    //txtDevice.process(document.getPages().get_Item(23), "c:/pdftest/ExtractedText.txt");

    //Get the text from pages of PDF and save it to OutputStream object
    txtDevice.process(page, text_stream);
}

//Close stream object
text_stream.close();

Hi Nayyer Shahbaz,


You will not be able to reproduce it in Windows, I was able to reproduce only in Linux Server (Ubuntu 10.04 and 12.04 with jre1.7.0_45)

Why are you not using the same snippet than mine ?
Could you try the same code ?

for (int i = 1; i <= pages.size(); i++)
{
Page currentPage = pages.get_Item(i);

com.aspose.pdf.TextFragmentAbsorber textAbsorber = new com.aspose.pdf.TextFragmentAbsorber();

currentPage.accept(textAbsorber);

Simon

Same problem with 9.5.2


java.lang.NullPointerException
at com.aspose.pdf.internal.p440.z12.m2(Unknown Source)
at com.aspose.pdf.internal.p440.z12.m7(Unknown Source)
at com.aspose.pdf.internal.p440.z14.m1(Unknown Source)
at com.aspose.pdf.internal.p440.z14.m1(Unknown Source)
at com.aspose.pdf.internal.p440.z14.m6(Unknown Source)
at com.aspose.pdf.internal.p440.z14.(Unknown Source)
at com.aspose.pdf.internal.p440.z14.(Unknown Source)
at com.aspose.pdf.TextFragmentAbsorber.visit(Unknown Source)
at com.aspose.pdf.Page.accept(Unknown Source)

Hi Simon,


Thanks for your feedback. We are checking the scenario over Ubuntu server and will update you soon.

Best Regards,

Hi Simo,

Thanks for your patience. I have tested the scenario with your sample code over Ubuntu 13.10 using Apose.Pdf for Java 9.5.2 and unable to replicate the issue. We have found a similar issue in past on non-windows OSs issue that caused due to font folder path setting. Please set font folder path according to your system font folder before start any processing. For example, at my Ubuntu server I have installed Microsoft fonts and setting font path as following. You need to set font folder path according to your machine setting. Hopefully it will resolve the issue.

sudo apt-get install ttf-mscorefonts-installer
// Set font folder path
String path = "/usr/share/fonts/truetype/msttcorefonts/";

// Adding a single font directory
// com.aspose.pdf.Document.addLocalFontPath(path);

// setting the user list for
// standard font directories
java.util.List list = com.aspose.pdf.Document.getLocalFontPaths();
list.add(path);

// Loop through the
// Text fragments
com.aspose.pdf.Document.setLocalFontPaths(list);
com.aspose.pdf.Document pdfDocument = new com.aspose.pdf.Document("form1.pdf");

PageCollection pages = pdfDocument.getPages();

for (int i = 1; i <= pages.size(); i++) {
   Page currentPage = pages.get_Item(i);

   com.aspose.pdf.TextFragmentAbsorber textAbsorber = new com.aspose.pdf.TextFragmentAbsorber();
   currentPage.accept(textAbsorber);

   // Loop through the Text fragments
   for (com.aspose.pdf.TextFragment textFragment : (Iterable<com.aspose.pdf.TextFragment>) textAbsorber.getTextFragments())
   {
   System.out.println("Text :- " + textFragment.getText());
   }
}

Please feel free to contact us for any further assistance.

Best Regards,

Hi,


Adding the localfontpath fix the problem.

Thank you

Simon

Hi Simon,


Thanks for the acknowledgement.

We are glad to hear that your problem is resolved. Please continue using our API’s and in the event of any further query, please feel free to contact.

This has fixed our problem. Is there any reason why we need to specify these directories explicitly? Can Aspose better handle this to recursively search font directories? Can Aspose provide a better error message for future issues with fonts? It took an awful long time to realize that we just needed to point to the correct MS-Fonts directory path.


Thank you!

Hi Simon,


The PDF files may contain contents based on fonts other than standard TrueType fonts and in order for API to properly recognize text, it need to load the font glyph. However your suggestion of presenting / showing proper error message regarding missing font seems good idea and I have logged this enhancement request in our issue tracking system as PDFNEWJAVA-34577. We will further look into this requirement and will keep you posted on the status of correction. Please be patient and spare us little time.

The issues you have found earlier (filed as PDFNEWJAVA-34577) have been fixed in Aspose.Pdf for Java 10.8.0.


This message was posted using Notification2Forum from Downloads module by Aspose Notifier.