Text extracted become separated chars

davidknn · April 26, 2023, 4:04am

I found a text extraction problem in a PDF. The sentences extracted in this PDF are broken into chars.

The Aspose version is 23.4.

I open the PDF in Acrobat and copy the text, and it looks normal.

Please check the attached PDF and codes.

sergei.shibanov · April 26, 2023, 2:39pm

@davidknn
Text in a pdf document can be represented by two operators TJ and Tj. The difference between the TJ and Tj operators is that TJ is an array, and each segment of text is broken into small chunks. The Tj statement contains the entire text segment, and TextFragment can contain more than one word inside. In this document, text is represented by TJ operators. This means that each TextFragment you receive from the ParagraphAbsorber may not necessarily be associated with a single word, and may be represented as a chunk of a word and this is a normal. The solution that might be useful for you is to use a TextFragmentAbsorber with a Regex search.

sergei.shibanov · April 26, 2023, 2:44pm

@davidknn
like this

var pdf = new Document(myDir + "example.pdf");
var page = pdf.Pages[1];
            
var absorber = new TextFragmentAbsorber(new Regex(@"\w*\w"));
absorber.Visit(page);

foreach (TextFragment fragment in absorber.TextFragments)
{
    Console.WriteLine(fragment.Text);
}

davidknn · April 27, 2023, 3:57am

Thank you for your reply!

I need to extract paragraphs instead of text fragments. Is it possible to use similar regex for ParagraphAbsorber to solve this?

sergei.shibanov · April 27, 2023, 3:10pm

@davidknn
For this, it is more difficult - since only lines can be selected from the TextFragmentAbsorber in the text. It does not have any special character for a single paragraph.
To select lines, use

var absorber = new TextFragmentAbsorber(new Regex(@".*\r\n"));

since it is visually visible that the paragraphs are separated from each other at different distances, you can use Position.YIndent for the found fragments to break them into paragraphs.

sergei.shibanov · April 27, 2023, 6:14pm

@davidknn
I took another look at the ParagraphAbsorber API. When using code:

var pdf = new Document(myDir + "example.pdf");
var page = pdf.Pages[1];

var absorber = new ParagraphAbsorber();
absorber.Visit(page);

var pageMarkup = absorber.PageMarkups[0];

foreach (var section in pageMarkup.Sections)
{
    foreach (var paragraph in section.Paragraphs)
    {
        Console.WriteLine(paragraph.Text);
        Console.WriteLine("-----");
    }
}

the desired result is obtained.
(except that one of the paragraphs is recognized as two - I will create an issue for the development team about this).
result.png (147.0 KB)

sergei.shibanov · April 28, 2023, 10:43am

@davidknn
We have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.

Issue ID(s): PDFNET-54492

You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.

davidknn · April 30, 2023, 10:00am

Hi Sergei,

Thanks for the detailed illustration.

I still have one more question:

If there are two operators TJ and Tj, can we know which format the pdf document has?

If we can, then we may use different strategy for these 2 cases.

sergei.shibanov · May 2, 2023, 2:41pm

@davidknn
I guess this is not necessary. The ParagraphAbsorber should work for a variety of text presentations. My answer Text extracted become separated chars - #3 by sergei.shibanov was due to my not fully understanding the issue. Please excuse me.

aspose.notifier · April 18, 2024, 7:06pm

The issues you have found earlier (filed as PDFNET-54492) have been fixed in Aspose.PDF for .NET 24.4.