Finding text in PDF using TextFragmentAbsorber

JanezZ · September 24, 2019, 2:25pm

Hello,

we are trying to find text in PDF delimited with two character sequences. The example code for regex based search using TextFragmentAbsorber worked great when the PDF was generated using one tool, but now when the PDF is generated using a different tool Aspose is unable to find any text in the document.

Here is a snippet of the code we’re using and example document. This code always prints out 0 matching text fragments.

TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber("::1::.*::1::", new 
TextSearchOptions(true));
textFragmentAbsorber.getTextSearchOptions().setLogTextExtractionErrors(true);

pdfDocument.getPages().accept(textFragmentAbsorber);
	
System.out.println("Extracted text: " + textFragmentAbsorber.getText());

if (textFragmentAbsorber.hasErrors()) {
	    for (TextExtractionError error : textFragmentAbsorber.getErrors()) {
	      // TextExtractionError object contains information about the
	      // text extraction error found during processing concrete 
	      // text fragment.
	      System.out.println(error);
	      System.out.println(String.format("Extracted text: '{0}'",
	      error.getExtractedText()));
	    }
}

System.out.println(String.format("Found %s matching text fragments for text %s",
    textFragmentAbsorber.getTextFragments().size(), valueToSearch));

test_document.pdf (45.1 KB)

Can you please have a look at the code and the attached document and tell us what could be wrong?

We’re using aspose-pdf for Java 18.9, but we have also tried versions 18.6 and 19.8. None of them worked. We’re running the program using JVM 1.8

Thanks

Farhan.Raza · September 24, 2019, 9:43pm

@JanezZ

Thank you for contacting support.

Would you please also share the corresponding document which works fine and text is searched from it. So that we may compare same files generated by different tools and investigate further to help you out.

JanezZ · October 1, 2019, 12:26pm

Hello and thanks for your response. Sorry it took a while to get back to you.

Please find attached a working document:

test_document_working.pdf (12.0 KB)

This works fine with both Aspose.PDF for Java 18.6 and 19.9

asad.ali · October 1, 2019, 10:09pm

@JanezZ

We have tested the scenario with Aspose.PDF for Java 19.9 and your both PDF documents. We were unable to notice the issue. The API extracted same text from both PDFs. An output console output is attached for your reference. ExtractedText.png (1.8 KB)

Would you kindly try again using Aspose.PDF for Java 19.9. In case you still face any issue, please share your complete environment details i.e. OS Name and Version, Application Type, etc. We will again try to test the scenario in our environment and address it accordingly.

JanezZ · October 2, 2019, 11:18am

Hello and thanks for your time.

Upon closer inspection we found a problem in the code for reading the file contents. So Aspose is now finding text correctly. However we still have one problem.

We are calling pdfDocument.getPagest().accept(textFragmentAbsorber) for each piece of delimited text we’re trying to find. It works fine for the first time we call accept() and it finds the text correctly.

The next time we call accept(textFragmentAbsorber) with a different phrase (or a different instance of TextFragmentAbsorber) we’re getting a Null Pointer Exception in one of the underlying Aspose methods:

at com.aspose.pdf.Operator.lI(Unknown Source)
at com.aspose.pdf.operators.ShowText.toString(Unknown Source)
at com.aspose.pdf.l16h.lI(Unknown Source)
at com.aspose.pdf.l16h.lI(Unknown Source)
at com.aspose.pdf.l16h.lI(Unknown Source)
at com.aspose.pdf.InternalHelper.lI(Unknown Source)
at com.aspose.pdf.internal.l5l.l0t.lI(Unknown Source)
at com.aspose.pdf.internal.l5l.l0t.le(Unknown Source)
at com.aspose.pdf.internal.l5l.l0t.<init>(Unknown Source)
at com.aspose.pdf.internal.l5l.l0t.<init>(Unknown Source)
at com.aspose.pdf.TextFragmentAbsorber.visit(Unknown Source)
at com.aspose.pdf.Page.accept(Unknown Source)
at com.aspose.pdf.PageCollection.accept(Unknown Source)

Can you please confirm that searching for multiple phrases on the same document is supported in any way ? We tried both reusing the textFragmentAbsorber instance and creating a new one each time with the same result.

EDIT: forgot to mention, version is Aspose.PDF for Java 19.9

EDIT2: Just ran a test case, it seems calling accept multiple times by itself is not a problem, however we are also calling:

fragment.setText(text)

on the found fragment, and this is causing the NPE. Maybe we should rewrite the program so we first find all the text fragments we’re interested in and then call setText()…

Thanks!

asad.ali · October 2, 2019, 8:16pm

@JanezZ

Every time you set the text of found TextFragment, it changes it and becomes different than what is already saved in TextFragmentCollection of TextFragmentAbsorber Instance. We need a complete code snippet from your side to test where the issue lies. Would you please share a sample code snippet which is able to replicate the issue with your sample PDF document. We will further proceed to assist you accordingly.

JanezZ · October 3, 2019, 2:18pm

Hello and thank you for your reply.

We managed to work around this exception by changing the text after finding all the fragments we’re interested in.

However we still have one small problem. We’re trying to set the color of the found text fragments but it stays black as the original…

Here’s a minimal code example to reproduce this issue:

package test.demo;


import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.net.URISyntaxException;
import java.net.URL;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.Iterator;
import java.util.List;

import com.aspose.pdf.Color;
import com.aspose.pdf.TextExtractionError;
import com.aspose.pdf.TextFragment;
import com.aspose.pdf.TextFragmentAbsorber;
import com.aspose.pdf.TextSearchOptions;
import com.aspose.pdf.TextState;

public class TextFragmentAbsorberTest {
	
	private TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber();
	private TextSearchOptions textSearchOptions = new TextSearchOptions(true);
	
	public static void main(String[] args) {
		TextFragmentAbsorberTest test = new TextFragmentAbsorberTest();
		try {
			test.findFragmentsMultipleTest();
		} catch (Exception e) {
			// TODO Auto-generated catch block
			e.printStackTrace();
		}
	}
	
	public void findFragmentsMultipleTest() throws Exception {

		File originalPdf = getTestDocumentFile("test_document.pdf");
		

		List<String> valuesToSearch = Arrays.asList("::1::.*::1::", "::2::.*::2::");
		
		com.aspose.pdf.Document pdfDocument = new com.aspose.pdf.Document(new FileInputStream(originalPdf));
		textFragmentAbsorber.setTextSearchOptions(textSearchOptions);
		
		List<TextFragment> foundFragments = new ArrayList<TextFragment>();
		
		for (String valueToSearch : valuesToSearch) {
			textFragmentAbsorber.setPhrase(valueToSearch);
			textFragmentAbsorber.getTextSearchOptions().setLogTextExtractionErrors(true);
			pdfDocument.getPages().accept(textFragmentAbsorber);
			
			if (textFragmentAbsorber.hasErrors()) {
			    // Information about found errors and locations is stored in 
			    // Errors collection.
			    for (TextExtractionError error : textFragmentAbsorber.getErrors()) {
			      // TextExtractionError object contains information about the
			      // text extraction error found during processing concrete 
			      // text fragment.
			      System.err.println(error);
			      System.err.println(String.format("Extracted text: '{0}'",
			      error.getExtractedText()));
			    }
			}
			
			System.out.println(String.format("Found %s matching text fragments for text %s",
					textFragmentAbsorber.getTextFragments().size(), valueToSearch));
			
			Iterator<TextFragment> txFragmentIterator = textFragmentAbsorber.getTextFragments().iterator();
			while (txFragmentIterator.hasNext()) {
				TextFragment fragment = (TextFragment) txFragmentIterator.next();
				System.out.println(fragment.getText());
                            foundFragments.add(fragment);
			}
		}
		
		for(TextFragment foundFragment: foundFragments) {
			TextState state = foundFragment.getTextState();
			state.setForegroundColor(Color.fromArgb(0, 255, 255, 255));
			// also tried
			// state.setForegroundColor(Color.getWhite());
			foundFragment.setText(foundFragment.getText());
			state.setForegroundColor(Color.fromArgb(0, 255, 255, 255));
		}
		
		pdfDocument.save(new FileOutputStream("out.pdf"));
	}
	
	private File getTestDocumentFile(String docName) throws URISyntaxException {
		System.out.println("Getting data for document name: " + docName);
		URL res = TextFragmentAbsorberTest.class.getResource(docName);
		System.out.println("Getting data for document: " + res.getPath());
		File file = new File(res.toURI());
		return file;
	}

}

EDIT: the problem actually isn’t apparent on the test document as the first delimited field is correctly changed in color. But with multiple delimited fields in the document at some point the color setting breaks down and random pieces of text are changed. I have a feeling it is because as one changes the text of the first fragment other fragment positions are invalidated ?

Do you have any idea how we could achieve the following:

find all delimited fragments
change their text and color

Thanks for your help!

asad.ali · October 3, 2019, 8:39pm

@JanezZ

Would you kindly share your sample PDF document which contains more of the text you are changing color of. We will test the scenario with respective PDF document and address it accordingly.

JanezZ · October 4, 2019, 3:14pm

Hi,

this document combined with the previous code reproduces the issue:

test_document_multiple.pdf (38.2 KB)

We managed to work around it by reloading the document after each replacement. It’s slower but it works without issues.

Thank you for your time!

asad.ali · October 4, 2019, 9:43pm

@JanezZ

We are checking the scenario and will get back to you shortly.