Hello;
I have made a code that search for some text into certain PDF files.
In most of them the libreary works fine but there are a few that AsposePDF can´t find text. All files have same format.
The code that I use is:
final PdfTextExtractor textExtractor = new PdfTextExtractor(reader);
for (int index = 1; index <= reader.getNumberOfPages(); index++) {
final String text = textExtractor.getTextFromPage(index);
boolean finFacturaEncontrado = true;
for (final String literal : textos) {
if (!StringUtils.contains(text, literal)) {
finFacturaEncontrado = false;
break;
}
}
if (finFacturaEncontrado) {
return index;
}
}
Basically we want to search for certain words (“Totales”, “Bruto:”, “Base imponible:”, “Impuestos:”, “Total:” ) into files and there are some of them that have this words and the library don´t find.
I have attach two files. Both of them have this words. With the file called “OK.pdf” Aspose can extract words but in the file called “KO.pdf” don´t.
Can you help me please?
Thank you
Hi there,
Thanks for your inquiry. It seems you have shared the iText sample code. However, I have tested the scenario with Aspose.Pdf for Java 10.4.0 and unable to notice the issue. Please check following sample code snippet to search and highlight the words list, its searching words successfully. Please download and try latest version of Aspose.Pdf for Java, it will resolve the issue.
Document document = new Document(myDir + "KO.pdf");
String words[] = new String[]{"Totales", "Bruto:", "Base imponible:", "Impuestos:", "Total:"};
for (String wrd : words) {
com.aspose.pdf.TextFragmentAbsorber textFragmentAbsorber1 =
new com.aspose.pdf.TextFragmentAbsorber(wrd, new TextSearchOptions(true));
for (int cnt = 1; cnt <= document.getPages().size(); cnt++) {
Page page = document.getPages().get_Item(cnt);
page.accept(textFragmentAbsorber1);
}
TextFragmentCollection textFragmentCollection1 = textFragmentAbsorber1.getTextFragments();
for (int cnt1 = 1; cnt1 <= textFragmentCollection1.size(); cnt1++) {
TextFragment textFragment = textFragmentCollection1.get_Item(cnt1);
for (TextSegment textSegment : (Iterable<TextSegment>) textFragment.getSegments()) {
textSegment.getTextState().setForegroundColor(com.aspose.pdf.Color.getBlack());
textSegment.getTextState().setBackgroundColor(com.aspose.pdf.Color.getLightBlue());
System.out.println(textSegment.getText() + " X: " +
(float) textSegment.getPosition().getXIndent() + " Y:" +
(float) textSegment.getPosition().getYIndent());
com.aspose.pdf.Rectangle rect = new
com.aspose.pdf.Rectangle(
(float) textSegment.getPosition().getXIndent(),
(float) textSegment.getPosition().getYIndent(),
(float) textSegment.getPosition().getXIndent() +
(float) textSegment.getRectangle().getWidth(),
(float) textSegment.getPosition().getYIndent() +
(float) textSegment.getRectangle().getHeight());
HighlightAnnotation highlight = new HighlightAnnotation(
textFragment.getPage(), rect);
highlight.setOpacity(.80);
highlight.setBorder(new Border(highlight));
highlight.setColor(com.aspose.pdf.Color.getLightBlue());
textFragment.getPage().getAnnotations().add(highlight);
}
}
}
// save updated document - you can set your output file here
document.save(myDir + "KO_highlight.pdf");
Please feel free to contact us for any further assistance.
Best Regards,
Ok Thank you
It works. Sorry for the confussion.
Hi there,
Hello, I reached this post from searching for a way to search a PDF for a certain string of text. May I know which part of the code above that is? Could you give me a sample code if I were to search for the string “OR Number” in a PDF?
Hi Junmil,