Hello,
Hi Adam,
I can confirm this issue which happens to aspose words converted docx files and trying to read them afterwards with aspose pdf.
It happens to about all documents i put in - so its maybe related to font embedding.
saving code of aspose words is not too magic:
// update document fields
document.updateFields();
document.updateListLabels();
document.updateTableLayout();
document.updatePageLayout();
document.updateWordCount();
FontSettings.setDefaultFontName(“Droid Sans Fallback”);
com.aspose.words.PdfSaveOptions opts = new com.aspose.words.PdfSaveOptions();
opts.setWarningCallback(new WordsWarningCallback("",""));
//opts.setEmbedFullFonts(true);
document.save(tempFile.getAbsolutePath(), opts);
read of the text not either:
// set text extraction options - set text extraction mode (Raw or Pure)
com.aspose.pdf.TextExtractionOptions textExtOptions = new com.aspose.pdf.TextExtractionOptions(com.aspose.pdf.TextExtractionOptions.TextFormattingMode.Pure);
com.aspose.pdf.devices.TextDevice txtDevice = new com.aspose.pdf.devices.TextDevice(textExtOptions);
txtDevice.setEncoding(Charset.forName(“UTF-8”));
// convert a particular page and save the image to stream
txtDevice.process(document.getPages().get_Item(convertPage), bos);
Exception is:
java.lang.NullPointerException
at com.aspose.pdf.internal.p435.z12.m2(Unknown Source)
at com.aspose.pdf.internal.p435.z12.m7(Unknown Source)
at com.aspose.pdf.internal.p435.z14.m1(Unknown Source)
at com.aspose.pdf.internal.p435.z14.m1(Unknown Source)
at com.aspose.pdf.internal.p435.z14.m6(Unknown Source)
at com.aspose.pdf.internal.p435.z14.(Unknown Source)
at com.aspose.pdf.internal.p435.z14.(Unknown Source)
at com.aspose.pdf.TextAbsorber.visit(Unknown Source)
at com.aspose.pdf.devices.TextDevice.processInternal(Unknown Source)
at com.aspose.pdf.devices.TextDevice.process(Unknown Source)
java.vm.name: Java HotSpot™ 64-Bit Server VM
java.vm.vendor: Oracle Corporation
java.vm.version: 24.60-b09
java.runtime.name: Java™ SE Runtime Environment
java.runtime.version:1.7.0_60-b19
os.name: Linux
os.arch: amd64
java.io.tmpdir: /tmp
file.encoding: UTF-8
sun.io.unicode.encoding:UnicodeLittle
sun.cpu.endian: little
Available processors (cores):8
Free memory (bytes):786217096
Maximum memory (bytes):8572502016
Total memory available to JVM (bytes):1056309248
wordsVersion: 14.9.0.0
cellsVersion: 8.2.1.0
slidesVersion: 14.7.0.0
pdfVersion: 9.5.0.0
Hi Adam,
Thanks for sharing the details.
I have logged the above stated * problem in our issue tracking system as PDFNEWJAVA-34523. We will investigate this issue in details and will keep you updated on the status of a correction.
We apologize for your inconvenience.
I have the same problem as well!
smcduff:
I have the same problem as well!
Hi Like I said the same method from the other post
smcduff:
Hi Like I said the same method from the other postfor (int i = 1; i <= pages.size(); i++)
{
Page currentPage = pages.get_Item(i);com.aspose.pdf.TextFragmentAbsorber textAbsorber = new com.aspose.pdf.TextFragmentAbsorber(); currentPage.accept(textAbsorber);
It does that for all PDF files. Can we have an estimate date ?
Hi Simon,
During my testing with following code snippet where I have used one of my sample PDF files, I am unable to notice any problem. However the customer who initiated this thread had an issue while extracting text from PDF file but the problem depends upon the structure and complexity of source/input PDF file. In order for us to test the scenario and to investigate the reasons of problem which you are facing, we request you to please share your source PDF files.
For your reference, I have attached my sample PDF document.
[Java]
//Create a new Aspose PDF Document object
com.aspose.pdf.Document document = new com.aspose.pdf.Document(new FileInputStream("c:/pdftest/UnderLineFormatting.pdf"));
//text file in which extracted text will be saved
java.io.OutputStream text_stream = new java.io.FileOutputStream("c:/pdftest/ExtractedText.txt", false);
//iterate through all the pages of PDF file
for(com.aspose.pdf.Page page : (Iterable<com.aspose.pdf.Page>)document.getPages()){
//set text extraction options - set text extraction mode (Raw or Pure)
com.aspose.pdf.TextExtractionOptions textExtOptions = new com.aspose.pdf.TextExtractionOptions(com.aspose.pdf.TextExtractionOptions.TextFormattingMode.Pure);
com.aspose.pdf.devices.TextDevice txtDevice = new com.aspose.pdf.devices.TextDevice(textExtOptions);
txtDevice.setEncoding(java.nio.charset.Charset.forName("UTF-8"));
txtDevice.setExtractionOptions(textExtOptions);
//convert a particular page and save the image to stream
//txtDevice.process(document.getPages().get_Item(23), "c:/pdftest/ExtractedText.txt");
//get the text from pages of PDF and save it to OutputStream object
txtDevice.process(page, text_stream);
}
//close stream object
text_stream.close();
Hi,
Still receiving the same issue in our build environment with your attached test PDF file:
build 11-Nov-2014 16:26:40 java.lang.NullPointerException: null
build 11-Nov-2014 16:26:40 at com.aspose.pdf.internal.p440.z12.m2(Unknown Source) ~[aspose-pdf-9.5.2-jdk16.jar:9.5.2]
build 11-Nov-2014 16:26:40 at com.aspose.pdf.internal.p440.z12.m7(Unknown Source) ~[aspose-pdf-9.5.2-jdk16.jar:9.5.2]
build 11-Nov-2014 16:26:40 at com.aspose.pdf.internal.p440.z14.m1(Unknown Source) ~[aspose-pdf-9.5.2-jdk16.jar:9.5.2]
build 11-Nov-2014 16:26:40 at com.aspose.pdf.internal.p440.z14.m1(Unknown Source) ~[aspose-pdf-9.5.2-jdk16.jar:9.5.2]
build 11-Nov-2014 16:26:40 at com.aspose.pdf.internal.p440.z14.m6(Unknown Source) ~[aspose-pdf-9.5.2-jdk16.jar:9.5.2]
build 11-Nov-2014 16:26:40 at com.aspose.pdf.internal.p440.z14.(Unknown Source) ~[aspose-pdf-9.5.2-jdk16.jar:9.5.2]
build 11-Nov-2014 16:26:40 at com.aspose.pdf.internal.p440.z14.(Unknown Source) ~[aspose-pdf-9.5.2-jdk16.jar:9.5.2]
build 11-Nov-2014 16:26:40 at com.aspose.pdf.TextAbsorber.visit(Unknown Source) ~[aspose-pdf-9.5.2-jdk16.jar:9.5.2]
build 11-Nov-2014 16:26:40 at com.aspose.pdf.Page.accept(Unknown Source) ~[aspose-pdf-9.5.2-jdk16.jar:9.5.2]
build 11-Nov-2014 16:26:40 at com.aspose.pdf.PageCollection.accept(Unknown Source) ~[aspose-pdf-9.5.2-jdk16.jar:9.5.2]
build 11-Nov-2014 16:26:40 at
Hi Simon,
Thanks for sharing the resource file.
I have tested the scenario using Aspose.Pdf for Java 9.5.2 in Eclipse Juno application running over Windows 7 (x64) with JDK 1.7 and I am unable to notice any problem. In another attempt, I have used following code snippet over Red Hat Enterprise Linux 5.6 with JDK 1.7.0_71 and I did not encounter any problem. For your reference, I have also attached the resultant file with contents extracted from PDF document.
Can you please share some details regarding your working environment so that we can further look into this matter.
Java
//Create a new Aspose PDF Document object
com.aspose.pdf.Document document = new com.aspose.pdf.Document(new java.io.FileInputStream("c:/pdftest/form1.pdf"));
//Text file in which extracted text will be saved
java.io.OutputStream text_stream = new java.io.FileOutputStream("c:/pdftest/ExtractedText.txt", false);
//Iterate through all the pages of PDF file
for(com.aspose.pdf.Page page : (Iterable<com.aspose.pdf.Page>)document.getPages()){
//Set text extraction options - set text extraction mode (Raw or Pure)
com.aspose.pdf.TextExtractionOptions textExtOptions = new com.aspose.pdf.TextExtractionOptions(com.aspose.pdf.TextExtractionOptions.TextFormattingMode.Pure);
com.aspose.pdf.devices.TextDevice txtDevice = new com.aspose.pdf.devices.TextDevice(textExtOptions);
txtDevice.setEncoding(java.nio.charset.Charset.forName("UTF-8"));
txtDevice.setExtractionOptions(textExtOptions);
//Convert a particular page and save the image to stream
//txtDevice.process(document.getPages().get_Item(23), "c:/pdftest/ExtractedText.txt");
//Get the text from pages of PDF and save it to OutputStream object
txtDevice.process(page, text_stream);
}
//Close stream object
text_stream.close();
Hi Nayyer Shahbaz,
Same problem with 9.5.2
Hi Simon,
Hi Simo,
Thanks for your patience. I have tested the scenario with your sample code over Ubuntu 13.10 using Apose.Pdf for Java 9.5.2 and unable to replicate the issue. We have found a similar issue in past on non-windows OSs issue that caused due to font folder path setting. Please set font folder path according to your system font folder before start any processing. For example, at my Ubuntu server I have installed Microsoft fonts and setting font path as following. You need to set font folder path according to your machine setting. Hopefully it will resolve the issue.
sudo apt-get install ttf-mscorefonts-installer
// Set font folder path
String path = "/usr/share/fonts/truetype/msttcorefonts/";
// Adding a single font directory
// com.aspose.pdf.Document.addLocalFontPath(path);
// setting the user list for
// standard font directories
java.util.List list = com.aspose.pdf.Document.getLocalFontPaths();
list.add(path);
// Loop through the
// Text fragments
com.aspose.pdf.Document.setLocalFontPaths(list);
com.aspose.pdf.Document pdfDocument = new com.aspose.pdf.Document("form1.pdf");
PageCollection pages = pdfDocument.getPages();
for (int i = 1; i <= pages.size(); i++) {
Page currentPage = pages.get_Item(i);
com.aspose.pdf.TextFragmentAbsorber textAbsorber = new com.aspose.pdf.TextFragmentAbsorber();
currentPage.accept(textAbsorber);
// Loop through the Text fragments
for (com.aspose.pdf.TextFragment textFragment : (Iterable<com.aspose.pdf.TextFragment>) textAbsorber.getTextFragments())
{
System.out.println("Text :- " + textFragment.getText());
}
}
Please feel free to contact us for any further assistance.
Best Regards,
Hi,
Hi Simon,
This has fixed our problem. Is there any reason why we need to specify these directories explicitly? Can Aspose better handle this to recursively search font directories? Can Aspose provide a better error message for future issues with fonts? It took an awful long time to realize that we just needed to point to the correct MS-Fonts directory path.
Hi Simon,
The issues you have found earlier (filed as PDFNEWJAVA-34577) have been fixed in Aspose.Pdf for Java 10.8.0.
This message was posted using Notification2Forum from Downloads module by Aspose Notifier.