Why some "TextFragment data" are different, when parsed by ParagraphAbsorber vs. by TableAbsorber?

KDSDEV · February 19, 2020, 4:08am

I tried to translate pdf document.
Strings need to be combined into sentences for translation.
I have used ParagraphAbsorber for that.

Then I want to use TableAbsorber to support tables.

APPROACH

Parts recognized in TableAbsorber-base-tranalation are excluded in ParagraphAbsorber-base-translation.
I am trying to detect these exclusion using the common object “TextSegment” parsed by the two absorbers,
TextSegment processed by TableAbsorber-base-translation is not processed by ParagraphAbsorber-base-translation.

PROBLEM

The unit/size of TextSegment are somtimes different between TableAbsorber and ParagraphAbsorber.

What options should be used to get the same unit/size of TextSegment in these two absorbers?

SAMPLE DATA and confirmation project(SegmentUnit.zip)

make TextSegment list by ParagraphAbsorber with registering in <Dictinary>
—> OUTPUT: sample_r_out_Paragraph.txt
make TextSegment list by TableAbsorber with checking abobe <Dictionary>
—> OUTPUT: sample_r_out_Table.txt
—> Find “*** none ***” in sample_r_out_Table.txt, this TextSegment does not mark by ParagraphAbsorber-base-listing.

i.e.
A data in TableAbsorber
[sample_r_out_Table.txt] (378.86,271.40) [26] *** none *** "Total Net Assets "
is analyzed in ParagraphAbsorber
[sample_r_out_Paragraph.txt] (378.86,180.00) [85] 1 "Total Assets Total Net Assets Net Assets Ratio Net Assets per Share "

SegmentUnit.zip: sample.pdf, OUTPUT Files(sample_r_out_Paragraph.txt, sample_r_out_Table.txt) sample project

Version: Aspose.PDF 19.6.0

SegmentUnit.zip (67.4 KB)

asad.ali · February 19, 2020, 12:59pm

@KDSDEV

We have logged an investigation ticket as PDFNET-47707 in our issue tracking system. We will definitely look into the details of the scenario and keep you posted with the status of ticket resolution. Please be patient and spare us some time.

We are sorry for the inconvenience.