Text extracted become separated chars

I found a text extraction problem in a PDF. The sentences extracted in this PDF are broken into chars.

The Aspose version is 23.4.

I open the PDF in Acrobat and copy the text, and it looks normal.

Please check the attached PDF and codes.

example.pdf (533.0 KB)

AsposeExample.zip (106.2 KB)

@davidknn
Text in a pdf document can be represented by two operators TJ and Tj. The difference between the TJ and Tj operators is that TJ is an array, and each segment of text is broken into small chunks. The Tj statement contains the entire text segment, and TextFragment can contain more than one word inside. In this document, text is represented by TJ operators. This means that each TextFragment you receive from the ParagraphAbsorber may not necessarily be associated with a single word, and may be represented as a chunk of a word and this is a normal. The solution that might be useful for you is to use a TextFragmentAbsorber with a Regex search.

@davidknn
like this

var pdf = new Document(myDir + "example.pdf");
var page = pdf.Pages[1];
            
var absorber = new TextFragmentAbsorber(new Regex(@"\w*\w"));
absorber.Visit(page);

foreach (TextFragment fragment in absorber.TextFragments)
{
    Console.WriteLine(fragment.Text);
}

Thank you for your reply!

I need to extract paragraphs instead of text fragments. Is it possible to use similar regex for ParagraphAbsorber to solve this?

@davidknn
For this, it is more difficult - since only lines can be selected from the TextFragmentAbsorber in the text. It does not have any special character for a single paragraph.
To select lines, use

var absorber = new TextFragmentAbsorber(new Regex(@".*\r\n"));

since it is visually visible that the paragraphs are separated from each other at different distances, you can use Position.YIndent for the found fragments to break them into paragraphs.

@davidknn
I took another look at the ParagraphAbsorber API. When using code:

var pdf = new Document(myDir + "example.pdf");
var page = pdf.Pages[1];

var absorber = new ParagraphAbsorber();
absorber.Visit(page);

var pageMarkup = absorber.PageMarkups[0];

foreach (var section in pageMarkup.Sections)
{
    foreach (var paragraph in section.Paragraphs)
    {
        Console.WriteLine(paragraph.Text);
        Console.WriteLine("-----");
    }
}

the desired result is obtained.
(except that one of the paragraphs is recognized as two - I will create an issue for the development team about this).
result.png (147.0 KB)

@davidknn
We have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.

Issue ID(s): PDFNET-54492

You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.

Hi Sergei,

Thanks for the detailed illustration.

I still have one more question:

If there are two operators TJ and Tj, can we know which format the pdf document has?

If we can, then we may use different strategy for these 2 cases.

@davidknn
I guess this is not necessary. The ParagraphAbsorber should work for a variety of text presentations. My answer Text extracted become separated chars - #3 by sergei.shibanov was due to my not fully understanding the issue. Please excuse me.

The issues you have found earlier (filed as PDFNET-54492) have been fixed in Aspose.PDF for .NET 24.4.