PdfContentEditor.replaceText (and PdfExtractor.extractText)

jephillips34 · August 24, 2009, 10:37am

I’ve been having problems trying to get PdfContentEditor.replaceText() to work. I can get it to work ONCE successfully, but that seems to corrupt

something that prevents it or other calls (like PdfExtractor.extractText) from working.

Basically, I’m want to extract all text from a PDF, do some parsing, and then selectively replace parts of the text

with replacement values.

I’ve got the extract part working reliably, at least until I try using replaceText. After that, it generates

NullPointerExceptions:

Caused by: java.lang.NullPointerException

at com.aspose.pdf.kit.g8.a(Unknown Source)

at com.aspose.pdf.kit.g8.int(Unknown Source)

at com.aspose.pdf.kit.t.V(Unknown Source)

at com.aspose.pdf.kit.h6.a(Unknown Source)

at com.aspose.pdf.kit.hf.a(Unknown Source)

at com.aspose.pdf.kit.PdfExtractor.extractText(Unknown Source)

For the PdfContentEditor.replaceText(), I can easily get it to successfully replaceText ONCE, and save the PDF.

However, I cannot figure out any way to successfully call the replaceText() method a second time for the same PDF.

I’ve tried a variety of work-arounds, using various combinations of ByteArray and/or File InputStream/OutputStream,

binding directly to the filePath, all of which fail with slightly different stack traces.

Here is some very basic test code to illustrate my problem. I’m using the Aspose.Pdf.Kit jar version 2.3.0.

We have an Aspose.Total java license, but it works the same in the evaluation mode.

We’re using java 1.5.

This test uses the pdftemplate.pdf file from the examples/resources directory in Aspose.Pdf.Kit 2.3.0:

File inputFile = new File(“C:\temp\pdftemplate.pdf”);
File outputFile = new File(“C:\temp\pdftemplate-out.pdf”);

PdfExtractor extractor = new PdfExtractor();
extractor.bindPdf(inputFile.getPath());
extractor.extractText();

// output to ByteArray
ByteArrayOutputStream baosPdfText = new ByteArrayOutputStream();
extractor.getText(baosPdfText);
String pdfTextString = baosPdfText.toString();
baosPdfText.close();

// this always works, first time around
System.out.println(“1st extract text: " + pdfTextString);

// If you comment out these 5 lines, extractor2 (below) will work
PdfContentEditor editor = new PdfContentEditor();
editor.bindPdf(inputFile.getPath());
editor.replaceText(”", “My Image Name”);
// a second call to replaceText throws NullPointerException at pdf.kit.g8.a()

// editor.replaceText("", “My Image2 Name”);

editor.save(outputFile.getPath());

// Extract text again

PdfExtractor extractor2 = new PdfExtractor();
extractor2.bindPdf(inputFile.getPath());

// extractText() fails if replaceText() method has been called
extractor2.extractText();
ByteArrayOutputStream baosPdfText2 = new ByteArrayOutputStream();
extractor2.getText(baosPdfText2);
String pdfTextString2 = baosPdfText2.toString();
baosPdfText2.close();
System.out.println("2nd extract text: " + pdfTextString2);

I had suspected that there is some problem with files getting properly closed, and so refactored to use ByteArray

Input and Output Streams, but some of the stack traces seemed to indicate that internally, PdfKit is trying to

use a java.io.RandomAccessFile to do its work. I’ve tried many combinations of Inputs & Outputs, but nothing

seems to get around this bug. What I’m REALLY confused about is why the call to replaceText would

corrupt any subsequent calls to extractText().

Are there any plans for bug-fixes in the works? I’ve looked through many threads here in the forum, and it

seems that PdfKit is one of the lower priority items in the development queue. This has been frustrating, because

it seems to ALMOST work, ALMOST do exactly what we need, but not quite.

John Phillips

Developer, Direxxis, Inc.

shahzadlatif · August 24, 2009, 11:37am

Hi John,

Thank you very much for considering Aspose.

First of all, I’m very sorry for the inconvenience due to this issue. Can you please share whether you’re having this problem with a particular file or any PDF file? Also, please share one of the problematic PDF files so we could test and resolve the issue.

Secondly, I would like to share that our development team is working on the issues with Aspose.Pdf.Kit; We’ll also try to resolve this issue the earliest possible.

We appreciate your patience and cooperation.
Regards,

jephillips34 · August 24, 2009, 11:47am

As I said in my first post, the sample code I posted uses the pdftemplate.pdf included in the Aspose.Pdf.Kit 2.3.0 examples/resources directory. I’ll attach the file here, but it’s just the same file used by your sample code.

jephillips34 · August 24, 2009, 12:16pm

I've been looking at the stack traces trying to get a better idea of what, exactly, is going wrong, and I've got an intuition about what may be going wrong here.

It looks likely to me that the Aspose.Pdf.Kit code is, at least partially, based on the apache incubator project РDFBox. If that's the case, then I think there may be some code that's intending to import the class org.apache.pdfbox.io.RandomAccessFile, but instead is accessing java.io.RandomAccessFile.

That would explain some of my stack traces:

java.io.IOException: The handle is invalid
at java.io.RandomAccessFile.seek(Native Method)
at com.aspose.pdf.kit.ht.read(Unknown Source)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:254)
at java.io.BufferedInputStream.read(BufferedInputStream.java:313)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
at java.io.BufferedInputStream.read(BufferedInputStream.java:235)
at java.io.FilterInputStream.read(FilterInputStream.java:66)
at java.io.PushbackInputStream.read(PushbackInputStream.java:120)
at com.aspose.pdf.kit.n8.else(Unknown Source)
at com.aspose.pdf.kit.hs.f(Unknown Source)
at com.aspose.pdf.kit.hs.c(Unknown Source)
at com.aspose.pdf.kit.f4.if(Unknown Source)
at com.aspose.pdf.kit.PdfContentEditor.replaceText(Unknown Source)

shahzadlatif · August 26, 2009, 4:46am

Hi John,

This issue is logged as PDFKITJAVA-10188 in our issue tracking system. Our development team is looking into the matter and you’ll be updated via this forum as the issue is resolved.

We’re sorry for the inconvenience.
Regards,

bendavid · August 30, 2009, 5:01am

Hi,

I seem to have similar issue on Java, 2.3. version:

I got exception in the following code:

PdfExtractor extractor = new PdfExtractor();
extractor.bindPdf(inputFileName);
extractor.extractText();
ByteArrayOutputStream baos=new ByteArrayOutputStream();
extractor.getText(baos);

java.lang.NullPointerException

at com.aspose.pdf.kit.g8.a(Unknown Source)

at com.aspose.pdf.kit.g8.int(Unknown Source)

at com.aspose.pdf.kit.t.V(Unknown Source)

at com.aspose.pdf.kit.h6.a(Unknown Source)

at com.aspose.pdf.kit.hf.a(Unknown Source)

at com.aspose.pdf.kit.PdfExtractor.extractText(Unknown Source)

unfortunately I can not share the offending pdf file with you.

Do you have enough information to reproduce the bug?

Thanks,

Shay

shahzadlatif · September 1, 2009, 12:47pm

Hi Shay,

It looks like the similar issue as logged above. We have reproduced the issue with other file. You’ll be updated via this forum as the issue is resolved.

However, it’ll be better if you share the PDF you’re having problem with, so we could make sure at our end that the fix works for you as well.

We’re sorry for the inconvenience.
Regards,

jephillips34 · September 3, 2009, 11:50am

I'm happy to find with the 2.4.0 version of pdf.kit that the PdfExtractor.extractText() method is now working reliably, regardless of any calls to PdfContentEditor.replaceText(). As it turns out, that was the biggest hurdle for what I'm trying to accomplish.

PdfContentEditor.replaceText() still logs this stack trace if you try to call replaceText() more than once on an instance of a PdfContentEditor:

java.io.IOException: The handle is invalid
at java.io.RandomAccessFile.seek(Native Method)
at com.aspose.pdf.kit.hx.read(Unknown Source)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:254)
at java.io.BufferedInputStream.read(BufferedInputStream.java:313)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
at java.io.BufferedInputStream.read(BufferedInputStream.java:235)
at java.io.FilterInputStream.read(FilterInputStream.java:66)
at java.io.PushbackInputStream.read(PushbackInputStream.java:120)
at com.aspose.pdf.kit.od.else(Unknown Source)
at com.aspose.pdf.kit.hw.f(Unknown Source)
at com.aspose.pdf.kit.hw.c(Unknown Source)
at com.aspose.pdf.kit.f8.if(Unknown Source)
at com.aspose.pdf.kit.PdfContentEditor.replaceText(Unknown Source)

The obfuscated class names are different, but it's basically the same stack trace as the previous version. What I didn't realize before is that replaceText() is logging that stack trace and then eating the exception. Now that PdfExtractor is fixed, there is no negative effect; the second and subsequent calls to replaceText simply do nothing.

However, I can successfully save the output, then re-bind the editor to the output, and then call replaceText() again successfully. I've tried this successfully with both filePaths and byte array input/output streams. If I have multiple replaceText() calls to make, I think there's probably alot less disk I/O using byte array streams. I'm still not sure what's using the java.io.RandomAccessFilein the stack trace, but it doesn't seem to be related to what type of object I'm binding to.

Thanks for version 2.4.0 !

shahzadlatif · September 7, 2009, 5:33am

Hi John,

I have updated our issue tracking system and we’ll look in the cause of this exception as well as we resolve the above mentioned issue.

If you find any other issues or questions, please do let us know.
Regards,

aspose.notifier · September 30, 2009, 9:49am

The issues you have found earlier (filed as 10188) have been fixed in this update.

This message was posted using Notification2Forum from Downloads module by aspose.notifier.

jephillips34 · September 30, 2009, 3:32pm

Thank you for the fixes in version 2.4.1.0. That’s working much better now.

I am still running into an issue replacing text, and I think I can help you narrow down the bug. Occasionally, a replaceText() method call seems to do nothing; if I look at the System.out console, I can see this stack trace logged, though it seems to be caught and swallowed by the replaceText() method:

java.lang.StringIndexOutOfBoundsException: String index out of range: -4
at java.lang.String.substring(String.java:1768)
at java.lang.String.substring(String.java:1735)
at com.aspose.pdf.kit.f9.if(Unknown Source)
at com.aspose.pdf.kit.PdfContentEditor.replaceText(Unknown Source)

I can step through the code with the eclipse debugger (I’m using the JD-Eclipse Plugin, a decompiler), and I think I’ve found the offending line of code:

in the obfuscated class com.aspose.pdf.kit.f9, in the method if(InputStream, OutputStream, String, String), somewhere around line 393:

else if (strToFind.length() - replacelength <= tempBulk.length())
replaceStr = destStr + tempBulk.substring(strToFind.length() - replacedlength);

I’m fairly certain that the replacedlength variable referenced in the substring() parameter should actually be replacelength (WITHOUT the d), and that’s what’s causing the seemingly random failures.

Looking forward to a quick fix.

John Phillips

shahzadlatif · October 2, 2009, 1:18am

Hi John,

I have contacted our development team for the details regarding this issue. You’ll be updated as I receive some response.

We’re sorry for the inconvenience.
Regards,

shahzadlatif · October 2, 2009, 7:36am

Hi John,

Please share the PDF file and the code snippet you’re having problem with. We need to test the issue at our end. We’ll update you via this forum after further investigation.

We’re sorry for the inconvenience.
Regards,

jephillips34 · October 29, 2009, 11:07am

Sorry for taking so long to respond. My boss did not want me to upload the specific document I was having troubles with (it contained confidential client information), so I tried to set up a sample PDF that would exhibit the exact same problem (StringIndexOutOfBoundsException), which I couldn’t do. However, I think I’ve got a much more useful sample for you.

I created the stringReplaceBug2.pdf using Word: first I entered all the text, and then I selected the last 3 characters of the 2nd line, changed to Bold, then changed back to normal again. Then I saved as PDF. This made the resulting PDF split the second line into two separate chunks, which is a similar problem to my original document.

Though this doesn’t produce the StringIndexOutOfBoundsException, the result of replaceText() on that 2nd line is clearly wrong (see stringReplaceBug2-out-bad.pdf). Stepping through the code, I was able to identify the code block responsible, and create an expression that fixes the problem code (see stringReplaceBug2-out-good.pdf). More on that below.

Here’s the sample code used to generate the output files:

public void testPdfKitReplaceTextBug() throws Exception {
AsposeLicenseManager.setAsposePdfkitLicense();

File inputFile = new File(“C:\direxxis\work\2009-10-29\stringReplaceBug2.pdf”);
File outputFile = new File(“C:\direxxis\work\2009-10-29\stringReplaceBug2-out.pdf”);
ByteArrayInputStream bais = null;
ByteArrayOutputStream baos = null;

PdfContentEditor editor = new PdfContentEditor();
byte[] baPdfFile = FileUtils.readFileToByteArray(inputFile);
bais = new ByteArrayInputStream(baPdfFile);
editor.bindPdf(bais);
editor.replaceText("[[Tag1]]", “Replacement1”);
editor.replaceText("[[Tag2]]", “Replacement2”);
editor.replaceText("[[Tag3]]", “Replacement3”);
baos = new ByteArrayOutputStream();
editor.save(baos);
baPdfFile = baos.toByteArray();
bais.close();
baos.close();

FileUtils.writeByteArrayToFile(outputFile, baPdfFile);
}

Here’s the problem code from the obfuscated f9 class, somewhere around lines 384-399:

String replaceStr = “”;
if (destStr.length() >= tempBulk.length()) {
replaceStr = destStr.substring(0, tempBulk.length());
if ((strToFind.length() < message.length()) && (strToFind.length() - replacedlength <= tempBulk.length()))
replaceStr = destStr;
}
else if (strToFind.length() - replacelength <= tempBulk.length()) {
replaceStr = destStr + tempBulk.substring(strToFind.length() - replacedlength);
}
else {
replaceStr = destStr;
}
q.a(previous, replaceIndex, replacelength, replaceStr);

Here’s the replacement expression I used to produce the good output:

String tempStr = “”;
if (destStr.length() >= replacelength) {
tempStr = destStr.substring(0, replacelength);
if ((strToFind.length() < message.length()) && (strToFind.length() - replacedlength <= replacelength))
tempStr = destStr;
}
else if (strToFind.length() - replacedlength <= replacelength)
tempStr = destStr + tempBulk.substring(strToFind.length() - replacedlength);
else
tempStr = destStr;
return tempStr;

I put a breakpoint on the q.a() method, and used the debug tools in eclipse to evaluate my replacement expression (using the state of the calling f9 method) to recalculate the 4th parameter (destStr), which then produced the correct results. (well, kind of correct; the chunks don’t align with the replacement text).

Edit 10/30/09: I’ve color-coded my code changes above.

Thanks,

John Phillips

shahzadlatif · October 30, 2009, 5:35am

Hi John,

Thank you very much for sharing the details. We’re looking into the matter and you’ll be updated the earliest possible. Moreover, can you please share the original Word document and the details regarding how you converted it to PDF? That might be of some help while working on this issue.

We’re sorry for the inconvenience.
Regards,

jephillips34 · October 30, 2009, 9:42am

Shahzad:

I’m attaching the Word docx file I used to generate my test pdf. I converted it to PDF using Microsoft Word 2007, simply saving as PDF (the save dialog say ‘Publish as PDF or XPS’), using whatever default settings were selected (probably Optimize for: Minimum size (publishing online)). If you open the word file with the aspose.words document explorer, you’ll see that the second line is broken into two runs, which correspond to the separate chunks (or whatever you call them) of text in the generated PDF.

In the ideal world, there would be a way to consolidate the separate chunks of text in a PDF, much as aspose.words can consolidate runs using the Document.joinRunsWithSameFormatting() method. For now, though, I’ll settle for getting this particular block of code corrected.

Edit: I’ve color coded my code changes in my previous post.

John Phillips

shahzadlatif · October 31, 2009, 11:37am

Hi John,

We have logged this issue as PDFKITJAVA-11402 in our issue tracking system. Our team is looking into the matter. You’ll be updated via this forum as the issue is resolved.

We appreciate your patience.
Regards,

aspose.notifier · December 7, 2009, 1:55pm

The issues you have found earlier (filed as 11402) have been fixed in this update.

This message was posted using Notification2Forum from Downloads module by aspose.notifier.