I am using Aspose.PDF to extract the text from a PDF using ParagraphAbsorber, however it does not appear to return all of the relevant whitespace, such as spaces between words.
The following code is similar to the code we are using in our project to walk through a PDF document Page by Page, then using ParagraphAbsorber extract the text (and eventually formatting changes) from the Page:
InputStream stream = new FileInputStream("test-documents/simpleWithFormattingChanges.pdf");
InputStream stream_2 = new FileInputStream("test-documents/multipleParagraphsNoFormattingChanges.pdf");
Document doc = new Document(stream);
PageCollection pages = doc.getPages();
for (Page page : pages) {
ParagraphAbsorber paragrapAbsorber = new ParagraphAbsorber();
paragrapAbsorber.visit(page);
List<PageMarkup> markups = paragrapAbsorber.getPageMarkups();
for(PageMarkup markup : markups) {
List<MarkupSection> sections = markup.getSections();
for (MarkupSection section : sections) {
List<MarkupParagraph> paragraphs = section.getParagraphs();
for (MarkupParagraph paragraph : paragraphs) {
List<l0t<TextFragment>> lines = paragraph.getLines();
for (l0t<TextFragment> line : lines) {
for (int i = 0; i < line.size(); i++) {
TextFragment fragment = line.get(i);
System.out.print(fragment.getText());
}
System.out.println();
}
}
}
}
}
This is the result of using the file with formatting changes:
First paragraph is plain.
Second paragraph has different font.
Third paragraphhas different font sizes.
Fourth paragraph has bold, underlineand italics.
—Notice the missing spaces after the word paragraph and underline.
If you use the other file which has no formatting changes the result is:
First paragraph is plain.
Second paragraph is plain.
Third paragraphis plain.
Fourth paragraph is plain.
— Notice the missing space after the word paragraph.
Note that each of these PDF files was created by selecting “Save As PDF” from Microsoft Word.
Any help would be greatly appreciated.
simpleWithFormattingChanges.pdf (190.9 KB)
multipleParagraphsNoFormattingChanges.pdf (86.9 KB)