Hi Team,
I am trying to get textFragment for the text using PageCollection.aacept() method using regex for a text which starts from a page and ends i another page. Using the code below textFragment is empty.
text’s regex = (?i)corporate(?:[(\d|\w.\s)])separation(?:[(\d|\w.\s)])rules(?:[(\d|\w.\s)])both(?:[(\d|\w.\s)])require(?:[(\d|\w.\s)])the(?:[(\d|\w.\s)])maintenance(?:[(\d|\w.\s)])of(?:[(\d|\w.\s)])detailed(?:[(\d|\w.\s)])books(?:[(\d|\w.\s)])and(?:[(\d|\w.\s)]*)records
Code:
public static void main(String[] args) throws Exception {
AsposeUtils asposeUtils = new AsposeUtils();
asposeUtils.applyALicense();
Document document = new Document(“66382.pdf”);
String text = “corporate separation rules both require themaintenance of detailed books and records”;
Integer pageNumber = 2;
log.info(“Fetching coordinates from page”);
var coordinateInfo = fetchFragments(text, document,pageNumber);
log.info(“Fetched coordinates from page {}”, coordinateInfo);
}
public static TextFragmentCollection fetchFragments(String text, Document document, Integer pageNumber) {
try{
PageCollection pages = document.getPages();
log.info("Inside fetch coordinates from page at pageNumber {} and text {} ", pageNumber, text);
var rgx = “(?i)corporate(?:[\(\d|\w\.\s\)])separation(?:[\(\d|\w\.\s\)])rules(?:[\(\d|\w\.\s\)])both(?:[\(\d|\w\.\s\)])require(?:[\(\d|\w\.\s\)])the(?:[\(\d|\w\.\s\)])maintenance(?:[\(\d|\w\.\s\)])of(?:[\(\d|\w\.\s\)])detailed(?:[\(\d|\w\.\s\)])books(?:[\(\d|\w\.\s\)])and(?:[\(\d|\w\.\s\)]*)records”;
TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber(rgx);
TextSearchOptions textSearchOptions = new TextSearchOptions(true);
textSearchOptions.setLogTextExtractionErrors(true);
textSearchOptions.setIgnoreShadowText(true);
textSearchOptions.setIgnoreResourceFontErrors(true);
textFragmentAbsorber.setTextSearchOptions(textSearchOptions);
log.info(“Before getting fragments from page at pageNumber {} and text {}”, pageNumber, text.substring(0, Math.min(text.length(), 50)));
pages.accept(textFragmentAbsorber);
log.info(“After getting fragments from page at pageNumber {} and text {}”, pageNumber, text.substring(0, Math.min(text.length(), 50)));
TextFragmentCollection textFragmentCollection = textFragmentAbsorber.getTextFragments();
if (textFragmentCollection.size() == 0) {
log.info(“No fragments found in page at pageNumber {} and text {}”, pageNumber, text);
return null;
}
return textFragmentCollection;
} catch (Exception e){
log.error("Exception occurred in cli process ", e);
return null;
}
}
fetchFragments method is returning null because textFragmentCollection.size() == 0.
Document:
66382.pdf (104.8 KB)
Please note: same code is working fine if text lies in a single page.