Using the following code to extract the attached document,
void extract(Page pageObject) {
var paragraphAbsorber = new ParagraphAbsorber();
paragraphAbsorber.visit(pageObject);
for (PageMarkup markup : paragraphAbsorber.getPageMarkups()) {
for (MarkupSection section : markup.getSections()) {
for (MarkupParagraph paragraph : section.getParagraphs()) {
String text = paragraph.getText();
System.out.println(text);
}
}
}
}
The extracted text does not match the text in the document.
The following text
Subject 999-999 was a 99-year-old xxxxxxxxxxx, who was diagnosed with atopic dermatitis in 9999 and had a disease duration of 9 years. The subject was randomized to receive placebo subcutaneous once every week starting on 99 XXX 9999 (Week x), as per protocol.
is extracted as
Subject was a -year-old , who was diagnosed with atopic dermatitis in and had a disease duration of years. The subject was randomized to receive placebo subcutaneous once every week starting on (Week999-999 ), as per protoc99 ol.