I tried to extract Japanese text from this file but some texts seems to be NULL string ("\0\0\0\0\0"), when I use ParagraphAbsorber.
https://support.casio.jp/storage/pdf/004/cfx-9850GCPLUS_J_02.pdf
In the above PDF file, you’ll find these Japanese strings;
“GRAPHメニュー” on page 2, and
“以下の例は、2つのグラフの交点を求めるときを除いて、すべて以下の関数式のグラフを描いてから操作したものとして説明します。” on page 3,
right beneath the page titles at the very top of each page.
These strings turn out to be “\0” with ParagraphAbsorber.
If I use TextFragmentAbsorber they are successfully extracted just as they are visually represented in the file.
However, I’d like to know whether this is some kind of bugs or not, because I’d love to use ParagraphAbsorber for extracting the texts, in order to manipulate those lines/sentences all together as a group in certain paragraph, not only one of the instances of a lot of the textFragment.Text.
I’d love to share what code snippets I used are like below.
ParagraphAbsorber [fail]
Document doc = new Document("cfx-9850GCPLUS_J_02.pdf");
ParagraphAbsorber absorber = new ParagraphAbsorber();
absorber.Visit(doc);
foreach (PageMarkup markup in absorber.PageMarkups)
{
foreach (MarkupSection section in markup.Sections)
{
foreach (MarkupParagraph paragraph in section.Paragraphs)
{
StringBuilder paragraphText = new StringBuilder();
foreach (List line in paragraph.Lines)
{
foreach (TextFragment textFragment in line)
{
foreach (TextSegment textSegment in textFragment.Segments)
{
string tst_tr = textSegment.Text;
paragraphText.Append(tst_tr);
}
}
}
}
}
}
TextFragmentAbsorber [OK]
Document doc = new Document("cfx-9850GCPLUS_J_02.pdf");
TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber(".+");
textFragmentAbsorber.TextSearchOptions = new TextSearchOptions(true);
textFragmentAbsorber.TextReplaceOptions.ReplaceAdjustmentAction = TextReplaceOptions.ReplaceAdjustment.WholeWordsHyphenation;
doc.Pages.Accept(textFragmentAbsorber);
TextFragmentCollection textFragmentCollection = textFragmentAbsorber.TextFragments;
foreach (TextFragment textFragment in textFragmentCollection)
{
foreach (TextSegment textSegment in textFragment.Segments)
{
string textSegmentText = textSegment.Text;
}
}
Please take a close look at textSegment.Text step by step, and give me your feedback for this problem. Feel free to point out wherever any of my explanation above is hard to understand.
Thank you.