I tried to translate pdf document.
Strings need to be combined into sentences for translation.
I have used ParagraphAbsorber for that.
Then I want to use TableAbsorber to support tables.
APPROACH
Parts recognized in TableAbsorber-base-tranalation are excluded in ParagraphAbsorber-base-translation.
I am trying to detect these exclusion using the common object “TextSegment” parsed by the two absorbers,
TextSegment processed by TableAbsorber-base-translation is not processed by ParagraphAbsorber-base-translation.
PROBLEM
The unit/size of TextSegment are somtimes different between TableAbsorber and ParagraphAbsorber.
What options should be used to get the same unit/size of TextSegment in these two absorbers?
SAMPLE DATA and confirmation project(SegmentUnit.zip)
- make TextSegment list by ParagraphAbsorber with registering in
<Dictinary>
—> OUTPUT: sample_r_out_Paragraph.txt - make TextSegment list by TableAbsorber with checking abobe
<Dictionary>
—> OUTPUT: sample_r_out_Table.txt
—> Find “*** none ***” in sample_r_out_Table.txt, this TextSegment does not mark by ParagraphAbsorber-base-listing.
i.e.
A data in TableAbsorber
[sample_r_out_Table.txt] (378.86,271.40) [26] *** none *** "Total Net Assets "
is analyzed in ParagraphAbsorber
[sample_r_out_Paragraph.txt] (378.86,180.00) [85] 1 "Total Assets Total Net Assets Net Assets Ratio Net Assets per Share "
SegmentUnit.zip: sample.pdf, OUTPUT Files(sample_r_out_Paragraph.txt, sample_r_out_Table.txt) sample project
Version: Aspose.PDF 19.6.0
SegmentUnit.zip (67.4 KB)