Auto detect TextFormattingMode for PDF Java

DmitriiTiukalov · August 3, 2023, 6:45pm

Hi,
I faced an issue with textAbsorber.getExtractionOptions().setFormattingMode(TextExtractionOptions.TextFormattingMode.<X>); with the Raw option it formats N columns in a PDF as I want (not by physical position but as you read it - column by column) but tables are completely unreadable - each cell starts from a new line. But with the option e.g. Pure it’s opposite - tables are formatted well but columns are shown by physical positions.
image.png (20.0 KB)
So, what I want here (Raw):

Overview
The explosion of unstructured data shows no sign of slowing.
Analysts predict that the growth of data is 40 percent to
60 percent, but for unstructured data in the enterprise, the

wrong (Pure):

Overview operational and permissions inconsistencies. Identify and
The explosion of unstructured data shows no sign of slowing. control open shares exposure without impacting information
Analysts predict that the growth of data is 40 percent to availability. Flexible query interface enables custom reporting,
60 percent, but for unstructured data in the enterprise, the ad-hoc analysis and third party integration through a web-

And tables:
image.png (11.1 KB)
As I want (Pure):

1 A C 2 3
B D
E
F G I J L
H K M
N
4 5 6 7
O P Q R S T
U V W
X Y
Z
This Is a empty 10
Table inside the
main Table

Wrong (Raw):

1 A
B
C
D
E
2 3
F G
H
I J
K
L
M
N
4 5 6 7
O P Q R S T
U V W
X Y
Z
This Is a
Table inside the
main Table
empty 10

I’m using aspose-pdf Java 23.6

How to force the Aspose lib to detect it automatically (assuming that a PDF document can contain multiple columns and tables simultaneously)?

asad.ali · August 3, 2023, 8:32pm

@DmitriiTiukalov

Can you please share your sample PDF document for our reference so that we can test the scenario in our environment and address it accordingly?

DmitriiTiukalov · August 3, 2023, 8:55pm

my_test.pdf (213.6 KB)
35655031-data-sheet-data-insight.pdf.pdf (457.6 KB)

asad.ali · August 4, 2023, 4:04am

@DmitriiTiukalov

We are checking it and will get back to you shortly.

asad.ali · August 15, 2023, 7:42pm

@DmitriiTiukalov

We have opened the following new ticket(s) in our internal issue tracking system for further investigation on this case. We will deliver their fixes according to the terms mentioned in Free Support Policies.

Issue ID(s): PDFJAVA-43035

We will further look into its details and keep you posted with the status of its correction. Please be patient and spare us little time.

We are sorry for the inconvenience.