Hello Aspose team, I have a PDF file that contains both text and tables. I need to extract only the text from the pdf without the tables. I am not able to extract only the text from the pdf. I am using Java 8 and Aspose PDF 19.2 jar. I am using the below code :
InputStream fis = new FileInputStream(filename);
com.aspose.pdf.Document pdfDoc = new com.aspose.pdf.Document(fis);
ParagraphAbsorber pa = new ParagraphAbsorber();
pa.visit(pdfDoc);
StringBuilder sbUnstructuredText = new StringBuilder ();
for (PageMarkup pm : pa.getPageMarkups()) {
for (MarkupSection ms : pm.getSections()) {
for (MarkupParagraph mp : ms.getParagraphs() ) {
logger.debug("markup paragraph: " + mp.getText());
for (TextFragment tfragment : mp.getFragments()) {
sbUnstructuredText.append(tfragment.getText());
logger.debug("textFragment: " + tfragment.getText());
sbUnstructuredText.append("/n");
}
sbUnstructuredText.append("/n");
}
}
}
The above also gets the Tables, which is not what I need. If there is any other way to get only Text please let me know.
I have attached the sample pdf I used.
attachment_UshurData_new.pdf (265.4 KB)
Regards,
Shashikant.