PDF extract paragraph

dongsp · November 26, 2020, 6:46am

JAVA extracts paragraphs with PDF20.11. Why is every line a paragraph?
Attached is the PDF in question文档翻译测试文档_英到中.pdf (248.3 KB)
文档翻译测试文档_中到英语.pdf (218.7 KB)
文档翻译测试文档_中到英语_长文档.pdf (351.1 KB)

asad.ali · November 26, 2020, 4:42pm

@dongsp

Could you please share a sample code snippet which you used at your side? We will test the scenario accordingly and share our feedback with you.

dongsp · November 27, 2020, 2:58am

code.png (50.8 KB)
Hi , attached is the code snippet

asad.ali · November 28, 2020, 4:05am

@dongsp

The API extracts text from PDF in a way it was added and present in it. We have noticed the similar behavior in our environment while extracting the paragraphs from your files. Therefore, have logged following tickets in our issue tracking system:

PDFJAVA-39977 (文档翻译测试文档_中到英语_长文档.pdf)
PDFJAVA-39978 (文档翻译测试文档_中到英语.pdf)
PDFJAVA-39979 (文档翻译测试文档_英到中.pdf)

We will further look into details of the logged tickets and keep you posted with the status of their correction. Please be patient and spare us some time.

We are sorry for the inconvenience.