Hi,
I faced an issue with textAbsorber.getExtractionOptions().setFormattingMode(TextExtractionOptions.TextFormattingMode.<X>);
with the Raw
option it formats N columns in a PDF as I want (not by physical position but as you read it - column by column) but tables are completely unreadable - each cell starts from a new line. But with the option e.g. Pure
it’s opposite - tables are formatted well but columns are shown by physical positions.
image.png (20.0 KB)
So, what I want here (Raw
):
Overview
The explosion of unstructured data shows no sign of slowing.
Analysts predict that the growth of data is 40 percent to
60 percent, but for unstructured data in the enterprise, the
wrong (Pure
):
Overview operational and permissions inconsistencies. Identify and
The explosion of unstructured data shows no sign of slowing. control open shares exposure without impacting information
Analysts predict that the growth of data is 40 percent to availability. Flexible query interface enables custom reporting,
60 percent, but for unstructured data in the enterprise, the ad-hoc analysis and third party integration through a web-
And tables:
image.png (11.1 KB)
As I want (Pure
):
1 A C 2 3
B D
E
F G I J L
H K M
N
4 5 6 7
O P Q R S T
U V W
X Y
Z
This Is a empty 10
Table inside the
main Table
Wrong (Raw
):
1 A
B
C
D
E
2 3
F G
H
I J
K
L
M
N
4 5 6 7
O P Q R S T
U V W
X Y
Z
This Is a
Table inside the
main Table
empty 10
I’m using aspose-pdf Java 23.6
How to force the Aspose lib to detect it automatically (assuming that a PDF document can contain multiple columns and tables simultaneously)?