How can we get the text from page.Contents?

CreativeCleaner · April 22, 2024, 9:30am

Due to certain special needs we need operate the page contents directly.
And when we read text by ShowText. It’s seems need some works to restore the origin text to the real text.

The code

var pdfDocument = new Document(@"sample.pdf");


foreach (var page in pdfDocument.Pages)
{
    foreach (var item in page.Contents.OfType<ShowText>())
    {
        Console.WriteLine(item.Text);
    }
}

The file:
sample.pdf (73.1 KB)

And now the output just like this:
图片.png (1.0 KB)

But the text in pdf is:
图片.png (3.7 KB)

How can we get the text?

Thanks.

sergei.shibanov · April 22, 2024, 4:52pm

@CreativeCleaner
It looks like the wrong encoding is being used for the string in this case.
I’ll create a task for the development team.

sergei.shibanov · April 22, 2024, 4:56pm

@CreativeCleaner
We have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.

Issue ID(s): PDFNET-57087

You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.

CreativeCleaner · April 23, 2024, 3:05am

Thanks for your reply and waiting for the ticket result.

sergei.shibanov · April 23, 2024, 5:49am

@CreativeCleaner
If there is something new on this issue, I will write.