Aspose.Pdf.Text.ParagraphAbsorber returns some texts as "\0"

KDSDEV · September 26, 2018, 9:13am

I tried to extract Japanese text from this file but some texts seems to be NULL string ("\0\0\0\0\0"), when I use ParagraphAbsorber.
https://support.casio.jp/storage/pdf/004/cfx-9850GCPLUS_J_02.pdf

In the above PDF file, you’ll find these Japanese strings;
“GRAPHメニュー” on page 2, and
“以下の例は、２つのグラフの交点を求めるときを除いて、すべて以下の関数式のグラフを描いてから操作したものとして説明します。” on page 3,
right beneath the page titles at the very top of each page.
These strings turn out to be “\0” with ParagraphAbsorber.

If I use TextFragmentAbsorber they are successfully extracted just as they are visually represented in the file.
However, I’d like to know whether this is some kind of bugs or not, because I’d love to use ParagraphAbsorber for extracting the texts, in order to manipulate those lines/sentences all together as a group in certain paragraph, not only one of the instances of a lot of the textFragment.Text.

I’d love to share what code snippets I used are like below.

ParagraphAbsorber [fail]

Document doc = new Document("cfx-9850GCPLUS_J_02.pdf");
ParagraphAbsorber absorber = new ParagraphAbsorber();
absorber.Visit(doc);
foreach (PageMarkup markup in absorber.PageMarkups)
{
    foreach (MarkupSection section in markup.Sections)
    {
        foreach (MarkupParagraph paragraph in section.Paragraphs)
        {
            StringBuilder paragraphText = new StringBuilder();
            foreach (List line in paragraph.Lines)
            {
                foreach (TextFragment textFragment in line)
                {
                    foreach (TextSegment textSegment in textFragment.Segments)
                    {
                        string tst_tr = textSegment.Text;
                        paragraphText.Append(tst_tr);
                    }
                }
             }
         }
    }
}

TextFragmentAbsorber [OK]

Document doc = new Document("cfx-9850GCPLUS_J_02.pdf");

TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber(".+");
textFragmentAbsorber.TextSearchOptions = new TextSearchOptions(true);
textFragmentAbsorber.TextReplaceOptions.ReplaceAdjustmentAction = TextReplaceOptions.ReplaceAdjustment.WholeWordsHyphenation;

doc.Pages.Accept(textFragmentAbsorber);
TextFragmentCollection textFragmentCollection = textFragmentAbsorber.TextFragments;

foreach (TextFragment textFragment in textFragmentCollection)
{
    foreach (TextSegment textSegment in textFragment.Segments)
    {
        string textSegmentText = textSegment.Text;
    }
}

Please take a close look at textSegment.Text step by step, and give me your feedback for this problem. Feel free to point out wherever any of my explanation above is hard to understand.
Thank you.

Farhan.Raza · September 26, 2018, 7:02pm

@KDSSHO

Thank you for contacting support.

We have worked with the data shared by you and have been able to reproduce the issue in our environment. A ticket with ID PDFNET-45464 has been logged in our issue management system for further investigation and resolution. The ticket ID has been linked with this thread so that you will receive notification as soon as the ticket is resolved.

We are sorry for the inconvenience.

KDSDEV · November 13, 2018, 2:55am

Hi.

Any update here? If not, I’ll just wait then.

Farhan.Raza · November 13, 2018, 10:57am

@KDSSHO

Thank you for getting back to us.

We are afraid PDFNET-45464 is currently pending for investigations. It will be investigated on its due turn that can take a few more months. We will let you know as soon as some significant updates will be available. We appreciate your patience and comprehension in this regard.

KDSDEV · November 15, 2018, 12:22am

Thank you for your reply. I understand the issue can take a few more months.

Farhan.Raza · November 15, 2018, 8:13am

@KDSSHO

Thank you for understanding.

We will notify you as soon as the ticket will be resolved.

aspose.notifier · December 19, 2022, 9:36pm

The issues you have found earlier (filed as PDFNET-45464) have been fixed in Aspose.PDF for .NET 22.12.