Garbled text extracted using ParagraphAbsorber

I encountered an issue of garbled text with unexpected spaces when using Aspose.PDF for .NET. The detail is illustrated in the code as below:

AsposeExample.zip (430.7 KB)

The sample PDF:
example.pdf (71.0 KB)

Please check it out. Thank you!

@davidknn
When I try to reproduce the issue, I get an exception. The paras variable has only one Text element and it doesn’t match any of the three options.
image.png (42.5 KB)
And how does it work for you?

This is not possible. I checked again with the PDF attachment I uploaded.

In my screenshot the paras variable has 40 elements, and the first element has a text of 6 Chinese chars, which is the top left header of the pdf page.

vs_screenshot.png (59.1 KB)

I’m using Aspose.PDF 23.7.

Is there any chance that your Aspose license is not activated properly so the absorber only extract a few words from the begining?

Otherwise it shouldn’t output only 1 paragraph in the page.

@davidknn
Perhaps this is due to the lack of fonts used in the document in my system.
image.png (26.2 KB)
Could you attach them?

@davidknn
By the way, what font do you use in Visual Studio - very beautiful.

It’s a Chinese font of Song family. I attached several common Song fonts as below. The missing font in your screenshot is AdobeSongStd-Light.

It’s too big to upload so I upload it to Dropbox.

The font in my VS is actually the default font in Chinese version, which is a branch of Song font family.

@davidknn
Thank you.
There are too many questions/indications of errors in what you wrote, and I will consistently reproduce them and create tasks for the development team.
I reproduced the part that concerns the appearance of gaps in the ParagraphAbsorber and I will now set the task for devveloper command.

Thank you for your hardworking :slight_smile:

This issue may be complicated, so I tried to show it in every possible way. Hope your team can solve it soon!

@davidknn
We have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.

Issue ID(s): PDFNET-55165,PDFNET-55166,PDFNET-55171

You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.

@davidknn
I set three tasks to the development team:
PDFNET-55165: ParagraphAbsorber adds unexpected spaces
PDFNET-55166: Extract text with TextAbsorber in specified rect has unexpected result
PDFNET-55171: For a TextAbsorber search after changing the TextAbsorber.TextSearchOptions.Rectangle option, the search result is unexpected.

@davidknn
I wrote to the development team about the not very clear explanation of TextExtractionOptions.TextFormattingMode.Flatten (for me, too, the explanation in the code was not clear) and asked for more clarification.
Briefly: in a pdf document, the text is represented by a set of substrings (sometimes even characters) with the given coordinates of their location on the page. Usually these substrings in the document itself are arranged in the order following each other on the page. And in the normal modes of operation of the TextAbsorber, this is assumed.
image.png (130.7 KB)

TextExtractionOptions.TextFormattingMode.Flatten has been made to work with fragment coordinates and whitespace handling has been improved there. Working in this mode for most cases should give the best result. But less performance is possible and this option is new.