Thanks for getting back to us.
We have tested your both methods using ParagraphAbsorber
and TableAbsorber
with Aspose.PDF for Java 18.9 and did not notice any issue. All text from PDF document was extracted. Furthermore, with this particular PDF document, ParagraphAbsorber
Class is able to extract every text whereas, TableAbosrber
can only be used to extract table data.
Please check following code snippet and extracted text output for your reference, which we have used in our environment to test the scenario:
Document doc = new Document(dataDir + "8.pdf");
ParagraphAbsorber pa = new ParagraphAbsorber();
pa.visit(doc);
for (PageMarkup pm:pa.getPageMarkups()){
for (MarkupSection ms:pm.getSections()){
for (MarkupParagraph mp:ms.getParagraphs()){
StringBuilder sb =new StringBuilder();
for(List< TextFragment> tflist : mp.getLines()){
for(TextFragment tf:tflist ){
sb.append(tf.getText());
}
sb.append("/r/n");
}
sb.append("/r/n");
System.out.println(sb);
}
}
}
try {
TableAbsorber absorber = new TableAbsorber();
PageCollection pc = doc.getPages();
for(Page pg:pc){
absorber.visit(pg);
com.aspose.pdf.internal.ms.System.Collections.Generic.IGenericList<AbsorbedTable> l = absorber.getTableList();
for(AbsorbedTable table:l){
for(AbsorbedRow row:table.getRowList())
{
for(AbsorbedCell cell:row.getCellList())
{
System.out.println(cell.getRectangle());
for(TextFragment tf:cell.getTextFragments())
{
for(TextSegment ts:tf.getSegments())
{
System.out.println(ts.getText());
}
}
}
}
}
}
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
outputtext.zip (1.6 KB)
Furthermore, it is always recommended to use latest version because it contains more fixes and enhancements. Please use latest version of the API with valid license. In case you do not have a valid license, you can get a 30-days temporary license from our website. Please feel free to let us know if you face any issue.