Aspose extracted text in wrong location

rye3000 · March 11, 2022, 3:36am

Using the following code to extract the attached document,

void extract(Page pageObject) {
    var paragraphAbsorber = new ParagraphAbsorber();
    paragraphAbsorber.visit(pageObject);

    for (PageMarkup markup : paragraphAbsorber.getPageMarkups()) {
        for (MarkupSection section : markup.getSections()) {
            for (MarkupParagraph paragraph : section.getParagraphs()) {
				String text = paragraph.getText();
				System.out.println(text);
			}
		}
	}
}

The extracted text does not match the text in the document.

The following text
Subject 999-999 was a 99-year-old xxxxxxxxxxx, who was diagnosed with atopic dermatitis in 9999 and had a disease duration of 9 years. The subject was randomized to receive placebo subcutaneous once every week starting on 99 XXX 9999 (Week x), as per protocol.

is extracted as

Subject was a -year-old , who was diagnosed with atopic dermatitis in and had a disease duration of years. The subject was randomized to receive placebo subcutaneous once every week starting on (Week999-999 ), as per protoc99 ol.

rye3000 · March 11, 2022, 3:40am

1524.pdf (199.6 KB)

tahir.manzoor · March 11, 2022, 8:34am

@rye3000

We have managed to reproduce the same issue at our side. For the sake of correction, we have logged this problem in our issue tracking system as PDFJAVA-41405. You will be notified via this forum thread once this issue is resolved.

We apologize for your inconvenience.