No TextFragment If Text Lies Across Multiple Pages

ashu_agrawal_sirionlabs_com · February 22, 2024, 12:55pm

Hi Team,

I am trying to get textFragment for the text using PageCollection.aacept() method using regex for a text which starts from a page and ends i another page. Using the code below textFragment is empty.

Code:

public static void main(String[] args) throws Exception {
AsposeUtils asposeUtils = new AsposeUtils();
asposeUtils.applyALicense();
Document document = new Document(“66382.pdf”);
String text = “corporate separation rules both require themaintenance of detailed books and records”;
Integer pageNumber = 2;
log.info(“Fetching coordinates from page”);
var coordinateInfo = fetchFragments(text, document,pageNumber);
log.info(“Fetched coordinates from page {}”, coordinateInfo);
}

public static TextFragmentCollection fetchFragments(String text, Document document, Integer pageNumber) {
try{
PageCollection pages = document.getPages();
log.info("Inside fetch coordinates from page at pageNumber {} and text {} ", pageNumber, text);
var rgx = “(?i)corporate(?:[\(\d|\w\.\s\)])separation(?:[\(\d|\w\.\s\)])rules(?:[\(\d|\w\.\s\)])both(?:[\(\d|\w\.\s\)])require(?:[\(\d|\w\.\s\)])the(?:[\(\d|\w\.\s\)])maintenance(?:[\(\d|\w\.\s\)])of(?:[\(\d|\w\.\s\)])detailed(?:[\(\d|\w\.\s\)])books(?:[\(\d|\w\.\s\)])and(?:[\(\d|\w\.\s\)]*)records”;
TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber(rgx);
TextSearchOptions textSearchOptions = new TextSearchOptions(true);
textSearchOptions.setLogTextExtractionErrors(true);
textSearchOptions.setIgnoreShadowText(true);
textSearchOptions.setIgnoreResourceFontErrors(true);
textFragmentAbsorber.setTextSearchOptions(textSearchOptions);
log.info(“Before getting fragments from page at pageNumber {} and text {}”, pageNumber, text.substring(0, Math.min(text.length(), 50)));
pages.accept(textFragmentAbsorber);
log.info(“After getting fragments from page at pageNumber {} and text {}”, pageNumber, text.substring(0, Math.min(text.length(), 50)));
TextFragmentCollection textFragmentCollection = textFragmentAbsorber.getTextFragments();
if (textFragmentCollection.size() == 0) {
log.info(“No fragments found in page at pageNumber {} and text {}”, pageNumber, text);
return null;
}
return textFragmentCollection;
} catch (Exception e){
log.error("Exception occurred in cli process ", e);
return null;
}
}

fetchFragments method is returning null because textFragmentCollection.size() == 0.

Document:
66382.pdf (104.8 KB)

Please note: same code is working fine if text lies in a single page.

asad.ali · February 22, 2024, 8:15pm

@ashu_agrawal_sirionlabs_com

We have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.

Issue ID(s): PDFJAVA-43622

You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.

ashu_agrawal_sirionlabs_com · July 18, 2024, 11:09am

Hi @asad.ali

We have the paid support, please fix this on priority and share the ETA for same.

asad.ali · July 18, 2024, 7:40pm

@ashu_agrawal_sirionlabs_com

If you have paid support, you can please login there an start a topic with the reference to the ticket ID shared here. Your issue will be escalated accordingly.

ashu_agrawal_sirionlabs_com · July 19, 2024, 6:28am

Hi @asad.ali

This account is upgraded to paid support.
OrderId: 240405060630

asad.ali · July 19, 2024, 3:30pm

@ashu_agrawal_sirionlabs_com

You must be able to login into paid support now by using the same email address that was used to purchase/subscribe to the paid support. Please login there and create a topic as we requested earlier.

aspose.notifier · January 7, 2025, 8:42pm

The issues you have found earlier (filed as PDFJAVA-43622) have been fixed in Aspose.PDF for Java 24.12.