Hi team,
I am trying to get the fragments of text from a page in pdf, using regex.
I have checked that regex is returning result but when using page.accept(textFragmentAbsorber) not receving any textfragment.
regex = "(?i)Departmental(?:[\(\d|\w\.\s\)])Business(?:[\(\d|\w\.\s\)])Continuity(?:[\(\d|\w\.\s\)])Plans(?:[\(\d|\w\.\s\)])will(?:[\(\d|\w\.\s\)])be(?:[\(\d|\w\.\s\)])maintained(?:[\(\d|\w\.\s\)])on(?:[\(\d|\w\.\s\)])a(?:[\(\d|\w\.\s\)])continuous(?:[\(\d|\w\.\s\)])basis(?:[\(\d|\w\.\s\)])and(?:[\(\d|\w\.\s\)])modified(?:[\(\d|\w\.\s\)])as(?:[\(\d|\w\.\s\)])risk(?:[\(\d|\w\.\s\)])assessments(?:[\(\d|\w\.\s\)])require(?:[\(\d|\w\.\s\)])or(?:[\(\d|\w\.\s\)])as(?:[\(\d|\w\.\s\)])other(?:[\(\d|\w\.\s\)])business(?:[\(\d|\w\.\s\)])factors(?:[\(\d|\w\.\s\)])may(?:[\(\d|\w\.\s\)])dictate.(?:[\(\d|\w\.\s\)])Plan(?:[\(\d|\w\.\s\)])audits(?:[\(\d|\w\.\s\)])will(?:[\(\d|\w\.\s\)])be(?:[\(\d|\w\.\s\)])conducted(?:[\(\d|\w\.\s\)])by(?:[\(\d|\w\.\s\)])Corporate(?:[\(\d|\w\.\s\)])Security(?:[\(\d|\w\.\s\)])to(?:[\(\d|\w\.\s\)])help(?:[\(\d|\w\.\s\)])ensure(?:[\(\d|\w\.\s\)])conformance(?:[\(\d|\w\.\s\)])to(?:[\(\d|\w\.\s\)])the(?:[\(\d|\w\.\s\)])BCM(?:[\(\d|\w\.\s\)])framework".
Code:
public static void main(String[] args) throws Exception {
AsposeUtils asposeUtils = new AsposeUtils();
asposeUtils.applyALicense();
Document document = new Document("66382.pdf");
String text = "Departmental Business Continuity Plans will be maintained on a continuous basis and modified as risk assessments require or as other business factors may dictate. Plan audits will be conducted by Corporate Security to help ensure conformance to the BCM framework. ";
Integer pageNumber = 30;
log.info("Fetching coordinates from page");
var coordinateInfo = fetchCoordinates(text, document,pageNumber);
log.info("Fetched coordinates from page {}", coordinateInfo);
}
public static TextFragmentCollection fetchCoordinates(String text, Document document, Integer pageNumber) {
try{
PageCollection pages = document.getPages();
Page page = pages.get_Item(pageNumber);
log.info("Inside fetch coordinates from page at pageNumber {} and text {} ", pageNumber, text);
var rgx = "?i)Departmental(?:[\\(\\d|\\w\\.\\s\\)])Business(?:[\\(\\d|\\w\\.\\s\\)])Continuity(?:[\\(\\d|\\w\\.\\s\\)])Plans(?:[\\(\\d|\\w\\.\\s\\)])will(?:[\\(\\d|\\w\\.\\s\\)])be(?:[\\(\\d|\\w\\.\\s\\)])maintained(?:[\\(\\d|\\w\\.\\s\\)])on(?:[\\(\\d|\\w\\.\\s\\)])a(?:[\\(\\d|\\w\\.\\s\\)])continuous(?:[\\(\\d|\\w\\.\\s\\)])basis(?:[\\(\\d|\\w\\.\\s\\)])and(?:[\\(\\d|\\w\\.\\s\\)])modified(?:[\\(\\d|\\w\\.\\s\\)])as(?:[\\(\\d|\\w\\.\\s\\)])risk(?:[\\(\\d|\\w\\.\\s\\)])assessments(?:[\\(\\d|\\w\\.\\s\\)])require(?:[\\(\\d|\\w\\.\\s\\)])or(?:[\\(\\d|\\w\\.\\s\\)])as(?:[\\(\\d|\\w\\.\\s\\)])other(?:[\\(\\d|\\w\\.\\s\\)])business(?:[\\(\\d|\\w\\.\\s\\)])factors(?:[\\(\\d|\\w\\.\\s\\)])may(?:[\\(\\d|\\w\\.\\s\\)])dictate.(?:[\\(\\d|\\w\\.\\s\\)])Plan(?:[\\(\\d|\\w\\.\\s\\)])audits(?:[\\(\\d|\\w\\.\\s\\)])will(?:[\\(\\d|\\w\\.\\s\\)])be(?:[\\(\\d|\\w\\.\\s\\)])conducted(?:[\\(\\d|\\w\\.\\s\\)])by(?:[\\(\\d|\\w\\.\\s\\)])Corporate(?:[\\(\\d|\\w\\.\\s\\)])Security(?:[\\(\\d|\\w\\.\\s\\)])to(?:[\\(\\d|\\w\\.\\s\\)])help(?:[\\(\\d|\\w\\.\\s\\)])ensure(?:[\\(\\d|\\w\\.\\s\\)])conformance(?:[\\(\\d|\\w\\.\\s\\)])to(?:[\\(\\d|\\w\\.\\s\\)])the(?:[\\(\\d|\\w\\.\\s\\)])BCM(?:[\\(\\d|\\w\\.\\s\\)])framework";
TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber(rgx);
TextSearchOptions textSearchOptions = new TextSearchOptions(true);
textSearchOptions.setLogTextExtractionErrors(true);
textSearchOptions.setIgnoreShadowText(true);
textSearchOptions.setIgnoreResourceFontErrors(true);
textFragmentAbsorber.setTextSearchOptions(textSearchOptions);
log.info("Before getting fragments from page at pageNumber {} and text {}", pageNumber, text.substring(0, Math.min(text.length(), 50)));
page.accept(textFragmentAbsorber);
log.info("After getting fragments from page at pageNumber {} and text {}", pageNumber, text.substring(0, Math.min(text.length(), 50)));
TextFragmentCollection textFragmentCollection = textFragmentAbsorber.getTextFragments();
if (textFragmentCollection.size() == 0) {
log.info("No fragments found in page at pageNumber {} and text {}", pageNumber, text);
return null;
}
return textFragmentCollection;
} catch (Exception e){
log.error("Exception occurred in cli process ", e);
return null;
}
}
fetchCoordinates method is returning null because textFragmentCollection.size() == 0.
Attached document:
66382.pdf (104.8 KB)
Attached regex match :regex_match.png (136.3 KB)
@ashu_agrawal_sirionlabs_com
We tested this code snippet with 24.1 version of the API and faced below error:
?i)Departmental(?:[\(\d|[\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]]\.\s\)])Business(?:[\(\d|[\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]]\.\s\)])Continuity(?:[\(\d|[\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]]\.\s\)])Plans(?:[\(\d|[\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]]\.\s\)])will(?:[\(\d|[\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]]\.\s\)])be(?:[\(\d|[\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]]\.\s\)])maintained(?:[\(\d|[\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]]\.\s\)])on(?:[\(\d|[\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]]\.\s\)])a(?:[\(\d|[\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]]\.\s\)])continuous(?:[\(\d|[\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]]\.\s\)])basis(?:[\(\d|[\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]]\.\s\)])and(?:[\(\d|[\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]]\.\s\)])modified(?:[\(\d|[\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]]\.\s\)])as(?:[\(\d|[\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]]\.\s\)])risk(?:[\(\d|[\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]]\.\s\)])assessments(?:[\(\d|[\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]]\.\s\)])require(?:[\(\d|[\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]]\.\s\)])or(?:[\(\d|[\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]]\.\s\)])as(?:[\(\d|[\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]]\.\s\)])other(?:[\(\d|[\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]]\.\s\)])business(?:[\(\d|[\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]]\.\s\)])factors(?:[\(\d|[\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]]\.\s\)])may(?:[\(\d|[\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]]\.\s\)])dictate.(?:[\(\d|[\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]]\.\s\)])Plan(?:[\(\d|[\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]]\.\s\)])audits(?:[\(\d|[\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]]\.\s\)])will(?:[\(\d|[\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]]\.\s\)])be(?:[\(\d|[\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]]\.\s\)])conducted(?:[\(\d|[\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]]\.\s\)])by(?:[\(\d|[\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]]\.\s\)])Corporate(?:[\(\d|[\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]]\.\s\)])Security(?:[\(\d|[\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]]\.\s\)])to(?:[\(\d|[\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]]\.\s\)])help(?:[\(\d|[\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]]\.\s\)])ensure(?:[\(\d|[\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]]\.\s\)])conformance(?:[\(\d|[\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]]\.\s\)])to(?:[\(\d|[\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]]\.\s\)])the(?:[\(\d|[\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]]\.\s\)])BCM(?:[\(\d|[\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]]\.\s\)])framework
^
java.util.regex.PatternSyntaxException: Dangling meta character '?' near index 0
?i)Departmental(?:[\(\d|[\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]]\.\s\)])Business(?:[\(\d|[\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]]\.\s\)])Continuity(?:[\(\d|[\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]]\.\s\)])Plans(?:[\(\d|[\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]]\.\s\)])will(?:[\(\d|[\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]]\.\s\)])be(?:[\(\d|[\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]]\.\s\)])maintained(?:[\(\d|[\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]]\.\s\)])on(?:[\(\d|[\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]]\.\s\)])a(?:[\(\d|[\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]]\.\s\)])continuous(?:[\(\d|[\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]]\.\s\)])basis(?:[\(\d|[\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]]\.\s\)])and(?:[\(\d|[\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]]\.\s\)])modified(?:[\(\d|[\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]]\.\s\)])as(?:[\(\d|[\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]]\.\s\)])risk(?:[\(\d|[\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]]\.\s\)])assessments(?:[\(\d|[\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]]\.\s\)])require(?:[\(\d|[\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]]\.\s\)])or(?:[\(\d|[\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]]\.\s\)])as(?:[\(\d|[\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]]\.\s\)])other(?:[\(\d|[\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]]\.\s\)])business(?:[\(\d|[\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]]\.\s\)])factors(?:[\(\d|[\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]]\.\s\)])may(?:[\(\d|[\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]]\.\s\)])dictate.(?:[\(\d|[\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]]\.\s\)])Plan(?:[\(\d|[\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]]\.\s\)])audits(?:[\(\d|[\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]]\.\s\)])will(?:[\(\d|[\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]]\.\s\)])be(?:[\(\d|[\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]]\.\s\)])conducted(?:[\(\d|[\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]]\.\s\)])by(?:[\(\d|[\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]]\.\s\)])Corporate(?:[\(\d|[\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]]\.\s\)])Security(?:[\(\d|[\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]]\.\s\)])to(?:[\(\d|[\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]]\.\s\)])help(?:[\(\d|[\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]]\.\s\)])ensure(?:[\(\d|[\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]]\.\s\)])conformance(?:[\(\d|[\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]]\.\s\)])to(?:[\(\d|[\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]]\.\s\)])the(?:[\(\d|[\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]]\.\s\)])BCM(?:[\(\d|[\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]]\.\s\)])framework
^
at java.util.regex.Pattern.error(Pattern.java:1969)
Can you please share the environment details in which you are able to run this code successfully? We will further proceed accordingly.
Hi @asad.ali
Working code snippet:
public static void main(String[] args) throws Exception {
AsposeUtils asposeUtils = new AsposeUtils();
asposeUtils.applyALicense();
Document document = new Document("66382.pdf");
String text = "Departmental Business Continuity Plans will be maintained on a continuous basis and modified as risk assessments require or as other business factors may dictate. Plan audits will be conducted by Corporate Security to help ensure conformance to the BCM framework. ";
Integer pageNumber = 30;
log.info("Fetching coordinates from page");
var coordinateInfo = fetchCoordinates(text, document,pageNumber);
log.info("Fetched coordinates from page {}", coordinateInfo);
}
public static TextFragmentCollection fetchCoordinates(String text, Document document, Integer pageNumber) {
try{
PageCollection pages = document.getPages();
Page page = pages.get_Item(pageNumber);
log.info("Inside fetch coordinates from page at pageNumber {} and text {} ", pageNumber, text);
var rgx = "(?i)Departmental(?:[\\(\\d|\\w\\.\\s\\)])Business(?:[\\(\\d|\\w\\.\\s\\)])Continuity(?:[\\(\\d|\\w\\.\\s\\)])Plans(?:[\\(\\d|\\w\\.\\s\\)])will(?:[\\(\\d|\\w\\.\\s\\)])be(?:[\\(\\d|\\w\\.\\s\\)])maintained(?:[\\(\\d|\\w\\.\\s\\)])on(?:[\\(\\d|\\w\\.\\s\\)])a(?:[\\(\\d|\\w\\.\\s\\)])continuous(?:[\\(\\d|\\w\\.\\s\\)])basis(?:[\\(\\d|\\w\\.\\s\\)])and(?:[\\(\\d|\\w\\.\\s\\)])modified(?:[\\(\\d|\\w\\.\\s\\)])as(?:[\\(\\d|\\w\\.\\s\\)])risk(?:[\\(\\d|\\w\\.\\s\\)])assessments(?:[\\(\\d|\\w\\.\\s\\)])require(?:[\\(\\d|\\w\\.\\s\\)])or(?:[\\(\\d|\\w\\.\\s\\)])as(?:[\\(\\d|\\w\\.\\s\\)])other(?:[\\(\\d|\\w\\.\\s\\)])business(?:[\\(\\d|\\w\\.\\s\\)])factors(?:[\\(\\d|\\w\\.\\s\\)])may(?:[\\(\\d|\\w\\.\\s\\)])dictate.(?:[\\(\\d|\\w\\.\\s\\)])Plan(?:[\\(\\d|\\w\\.\\s\\)])audits(?:[\\(\\d|\\w\\.\\s\\)])will(?:[\\(\\d|\\w\\.\\s\\)])be(?:[\\(\\d|\\w\\.\\s\\)])conducted(?:[\\(\\d|\\w\\.\\s\\)])by(?:[\\(\\d|\\w\\.\\s\\)])Corporate(?:[\\(\\d|\\w\\.\\s\\)])Security(?:[\\(\\d|\\w\\.\\s\\)])to(?:[\\(\\d|\\w\\.\\s\\)])help(?:[\\(\\d|\\w\\.\\s\\)])ensure(?:[\\(\\d|\\w\\.\\s\\)])conformance(?:[\\(\\d|\\w\\.\\s\\)])to(?:[\\(\\d|\\w\\.\\s\\)])the(?:[\\(\\d|\\w\\.\\s\\)])BCM(?:[\\(\\d|\\w\\.\\s\\)])framework";
TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber(rgx);
TextSearchOptions textSearchOptions = new TextSearchOptions(true);
textSearchOptions.setLogTextExtractionErrors(true);
textSearchOptions.setIgnoreShadowText(true);
textSearchOptions.setIgnoreResourceFontErrors(true);
textFragmentAbsorber.setTextSearchOptions(textSearchOptions);
log.info("Before getting fragments from page at pageNumber {} and text {}", pageNumber, text.substring(0, Math.min(text.length(), 50)));
page.accept(textFragmentAbsorber);
log.info("After getting fragments from page at pageNumber {} and text {}", pageNumber, text.substring(0, Math.min(text.length(), 50)));
TextFragmentCollection textFragmentCollection = textFragmentAbsorber.getTextFragments();
if (textFragmentCollection.size() == 0) {
log.info("No fragments found in page at pageNumber {} and text {}", pageNumber, text);
return null;
}
return textFragmentCollection;
} catch (Exception e){
log.error("Exception occurred in cli process ", e);
return null;
}
}
@ashu_agrawal_sirionlabs_com
We are checking it and will get back to you shortly.
@ashu_agrawal_sirionlabs_com
We have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.
Issue ID(s): PDFJAVA-43541
You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.
Hi @asad.ali,
Can we know priority with which this has been raised internally?
Also, we will be communicated on this ticket once it is resolved? I know you do not provide any exact timeline and do not commit to resolutions but any approximate timelines by when we can expect this? Maybe with your next minor release ? Any expected ETA will be helpful for us to plan internally.
@AdityaSirion
The ticket will be prioritized on a first come first serve basis and we will surely keep you updated and informed on the progress of the ticket resolution. As soon as the ticket is investigated, we will be able to share some news about its fix ETA. Please be patient and spare us some time.
We are sorry for the inconvenience.
The issues you have found earlier (filed as PDFJAVA-43541) have been fixed in Aspose.PDF for Java 24.3.
Hi @aspose.notifier @asad.ali This is not fixed yet. Please try same example as above. We have upgraded aspose version to 24.3 and this is still reproducible.
@AdityaSirion
We fixed the problem. Additionally, you need to modify the regular expression slightly because the text in the PDF file has additional space characters at each end of the lines. Please see the next code snippet:
Document document = new Document(getInputPdf());
String text = "Departmental Business Continuity Plans will be maintained on a continuous basis and modified as risk assessments require or as other business factors may dictate. Plan audits will be conducted by Corporate Security to help ensure conformance to the BCM framework.";
Page page = document.getPages().get_Item(30);
String rgx = "(?i)" + text.replace(" ", "(?:[\\(\\d|\\w\\.\\s\\)])+?");
//System.out.println(rgx);
TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber(rgx);
TextSearchOptions textSearchOptions = new TextSearchOptions(true);
textSearchOptions.setLogTextExtractionErrors(true);
textSearchOptions.setIgnoreShadowText(true);
textSearchOptions.setIgnoreResourceFontErrors(true);
textFragmentAbsorber.setTextSearchOptions(textSearchOptions);
page.accept(textFragmentAbsorber);
TextFragmentCollection textFragmentCollection = textFragmentAbsorber.getTextFragments();
System.out.println(textFragmentCollection.size());
System.out.println(textFragmentCollection.get_Item(1).getText());
2024-03-14 21.46.10.png (72.7 KB)