ParagraphAbsorber: wrong textFragment coordinates

ebiruk · November 3, 2022, 11:26am

Hello, I’m facing a problem that when I try to determine the contents of a pdf file, I get incorrect results.

	void wrongExtractionCoordinates() throws IOException {
	var inputStream = new ClassPathResource("pdf/Page 487.pdf").getInputStream();
	var document = new Document(inputStream);
	var page = document.getPages().get_Item(1);

	var paragraphAbsorber = new ParagraphAbsorber();
	paragraphAbsorber.visit(page);
	String text = paragraphAbsorber.getPageMarkups().get(0).getSections().get(3).getParagraphs().get(7).getText();
	assertTrue(text.startsWith("Subject 999-999 was a 99-year-old xxxxxxxxxxx,"));
}

As you can see in this test, the problems start with the line I expect to see in the assert. fragment 999-999 and a few more - lost. but if you follow it further, they appear 2 lines below.

What could be causing this issue and is there a known workaround?
Thanks!

asad.ali · November 3, 2022, 7:32pm

@ebiruk

We also noticed this issue in our environment while testing with 22.10 version of the API. Therefore, it has been logged as PDFJAVA-42189 in our issue tracking system. We will further look into its details and keep you posted with the status of its rectification. Please be patient and spare us some time.

We are sorry for the inconvenience.