Page 487.pdf (345.0 KB)
Hello, I’m facing a problem that when I try to determine the contents of a pdf file, I get incorrect results.
void wrongExtractionCoordinates() throws IOException {
var inputStream = new ClassPathResource("pdf/Page 487.pdf").getInputStream();
var document = new Document(inputStream);
var page = document.getPages().get_Item(1);
var paragraphAbsorber = new ParagraphAbsorber();
paragraphAbsorber.visit(page);
String text = paragraphAbsorber.getPageMarkups().get(0).getSections().get(3).getParagraphs().get(7).getText();
assertTrue(text.startsWith("Subject 999-999 was a 99-year-old xxxxxxxxxxx,"));
}
As you can see in this test, the problems start with the line I expect to see in the assert. fragment 999-999 and a few more - lost. but if you follow it further, they appear 2 lines below.
What could be causing this issue and is there a known workaround?
Thanks!