The keyword is not recognized by aspose pdf for java edition

ciicorp · August 13, 2020, 5:35am

The keyword is chinese, 甲方：（签章）,we need to locate the key words and return the coordinates.
but the keyword is not recognized by aspose pdf for java edition
javacodeAndtestFile.zip (26.6 KB)
The attachment include two files, one is demo code for java, the other one is pdf for test.
we try to test V19.12 and new release V 20.7, but both of them are not work well.

asad.ali · August 13, 2020, 4:39pm

@ciicorp

We tested using following code snippet and found that line break was present after each character in the PDF:

Document pdfDoc = new Document(dataDir + "aspseTestfile.pdf");
TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber();
// Accept the absorber for first page of document
pdfDoc.getPages().accept(textFragmentAbsorber);
// Get the extracted text fragments into collection
TextFragmentCollection textFragmentCollection = textFragmentAbsorber.getTextFragments();
//Loop through the Text fragments
for(TextFragment textFragment : (Iterable<TextFragment>)textFragmentCollection){ // Iterate through text segments
   System.out.println(textFragment.getText());
}

Output

甲
方
：
（
签
章
）

Therefore, we tried using following regular expression to extract text but did not get success;

Document pdfDoc = new Document(dataDir + "aspseTestfile.pdf");
TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber("甲\\s*" +
                "方\\s*" +
                "：\\s*" +
                "（\\s*" +
                "签\\s*" +
                "章\\s*" +
                "）\\b");
TextSearchOptions textSearchOptions = textFragmentAbsorber.getTextSearchOptions();
textSearchOptions.setRegularExpressionUsed(true);
// Accept the absorber for first page of document
pdfDoc.getPages().accept(textFragmentAbsorber);
// Get the extracted text fragments into collection
TextFragmentCollection textFragmentCollection = textFragmentAbsorber.getTextFragments();
//Loop through the Text fragments
for(TextFragment textFragment : (Iterable<TextFragment>)textFragmentCollection){ // Iterate through text segments
   System.out.println(textFragment.getText());
}

We have logged an issue as PDFJAVA-39676 in our issue tracking system for further investigation. We will look into its details and keep you informed about its resolution status. Please be patient and spare us some time.

We are sorry for the inconvenience.

asad.ali · October 15, 2020, 9:59pm

@ciicorp

We have investigated the logged ticket and found that extra invisible character placed only once. Text can be find with any of the following regex:

TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber("甲方：\\s?（签章）");
TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber("甲方：\\s*（签章）");
TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber(
                "甲\\s?" +
                "方\\s?" +
                "：\\s?" +
                "（\\s?" +
                "签\\s?" +
                "章\\s?" +
                "）\\s?");