Fragments extracted by ParagraphAbsorber contain only one char

davidknn · January 12, 2024, 4:19am

I use ParagraphAbsorber to extract paragraphs in this PDF. In every paragraph, I find every fragment contains only one char. And para.Text is missing blanks between the words.

Please use the following code and example PDF to reproduce:

example.pdf (870.2 KB)

AsposeExample.zip (211.1 KB)

Thank you!

sergei.shibanov · January 12, 2024, 4:04pm

@davidknn

I check it with code and in my environment there is no such effect that paragraphs consist of only one character.

var pdf = new Document(dataDir + "example.pdf");
var page = pdf.Pages[1];

var para_absorber = new ParagraphAbsorber();
para_absorber.Visit(page);

//extract sections and paragraphs
var sections = para_absorber.PageMarkups[0].Sections;

foreach (var s in sections)
{
    foreach (var p in s.Paragraphs)
    {
        Console.WriteLine(p.Text);
        Console.WriteLine("-----------------------------");
    }
}

The lack of spaces has been reproduced for me and I will create a task for the development team about this.

sergei.shibanov · January 12, 2024, 4:30pm

@davidknn
We have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.

Issue ID(s): PDFNET-56305

You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.

sergei.shibanov · January 12, 2024, 4:32pm

@davidknn
When creating a task regarding spaces, I added

As I discovered, if you select this line in Acrobat and paste it through the clipboard, there will also be no spaces. But nevertheless, it would be nice if they were added to the result.

davidknn · January 13, 2024, 7:20am

there is no such effect that paragraphs consist of only one character.

I mean, not every paragraph consists of 1 char, but every fragment consists of 1 char.

You may try to print not p.Text but p.Fragments[0].Text instead. You will find every fragment.Text is one letter.

if you select this line in Acrobat and paste it through the clipboard, there will also be no spaces.

It does. I reproduced it in my environment.

So, my guess is the text is stored in PDF by char?

sergei.shibanov · January 15, 2024, 6:12am

@davidknn

image.png (252.1 KB)
Yes, in the document you have attached, the text is represented by a set of characters from which words and paragraphs are composed.