Reading garbage text from PDF file

Modified Doc.PDF (15.9 KB)
Hi,

I am trying to read the attached file using Aspose.PDF in my c# project. When I do that, Aspose is reading garbage/encoded text. If I open the file in Adobe PDF reader, it looks fine. When I copy the text from the PDF and paste it into notepad, that again shows the garbage/encoded text.

What is the issue?

Thanks!

@sanchin,
The font “Helvetica” is embedded in the PDF document. Kindly install the required font on your system, and then you would be able to copy and paste the text. If you do not have the font, then you cannot copy and paste the text. You can connect the custom font directory to Aspose.Pdf for .NET API, before extracting text from the PDF document.

[C#]

// connect custom font directory
FolderFontSource fs = new FolderFontSource(@"path\to\my\folder");
FontRepository.Sources.Add(fs);

Best Regards,
Imran Rafique

1 Like

Thank you for that information Imran!

Couple more follow up questions.

  1. Where can I get the Helvetica font type? I searched online and it looks like I would have to buy it, which is fine with me but I am confused with the variety of Helvetica fonts available.
    Ex: Linotype has a ton of Helvetica fonts http://www.myfonts.com/fonts/linotype/helvetica/ . Will all variants of the Helvetica font work for Aspose.
  2. If my system does not have Helvetica, how is adobe acrobat reader able to display the text correctly?

Thanks!

Hi Imran,

It might not be the issue with Helvetica font. The PDF I uploaded was part of a larger PDF document that contained some sensitive information. So, I extracted one page from it using CutePDF and removed sensitive data from it using Fixit Phantom PDF which put the Helvetica font into it. Please look at the attached screen shots which show the font tab in the properties of the original PDF.

There are fonts named like Z@RC6B8.tmp, with TrueType and Built-In encoding. The original doc does not have Helvetica. What are my options to read this PDF with AsposePDF.

We have a license for AsposePDF(which I am trying to track down with colleagues), So I might soon open a paid support ticket.

Thanks!

Font1.PNG (5.4 KB)
Font2.PNG (4.8 KB)
Font3.PNG (4.8 KB)
Font4.PNG (4.6 KB)

Hi Imran,

Please look at this new document. This does not have the Helvetica font but has the same issue.

Modified Doc without Helvetica.pdf (16.1 KB)

Thanks!

@sanchin,
You are right that the issue is not with Helvetica font and it was applied to the empty text fragments. We can see three fonts named as Z@R2AD6.tmp, Z@R2A86.tmp and Z@R2B07.tmp, and these fonts are not available in the font snapshots of the original PDF. We are unable to identify these fonts and retrieving garbage text with Aspose.Pdf for .NET API. We have logged an investigation under the ticket ID PDFNET-42983 to retrieve the actual text as displayed in the Acrobat Reader or closest to it. We have linked your post to this ticket and will keep you informed regarding any available updates.

Best Regards,
Imran Rafique

Hi Imran,

We have the option to open a paid support ticket. Would that make any difference in how fast this issue is resolved?

Thanks!

@sanchin,

Thanks for contacting support.

The Priority Support or Enterprise Support do not guarantee an immediate solution but it only expedites the investigation process. In case you still want to raise the priority of these issues, please do let us know.

Hi @codewarior,

Yes, I would like to create a ticket under Priority or Enterprise support.

I have my organizations, username and license key details, but I do not have the password to log into the account. Please let me know how to proceed.

Thanks!

@sanchin,

You can use ‘I forgot my password’ option on the sign in page to reset your password https://passport.asposeptyltd.com/account/passwordreset?signin=de367014902a5e6bacf5e4ae1de37fbe

You need to log in using your Enterprise Support account in order to report a priority issue at https://helpdesk.aspose.com/. Please try this at your end and post under Aspose.Request category if you still see any issue.

Best Regards,

Hi Ijaz,

I do not have access to the email address with which we have the license. Our procurement department has those details and they will not share them with us.

Is there any other way to open a paid support ticket?

Thanks!

@sanchin,
We have posted your query in the Aspose.Request forum: Cannot login to paid support helpdesk

One of our fellow workers will assist you there soon. When you will login to Paid Support Helpdesk, then you can create a new post and share ticket ID PDFNET-42983, which you required to escalate.

Best Regards,
Imran Rafique

Hi Imran,

Please cancel/close the post about the paid support. I confirmed with Aspose, and we do not have paid support. Sorry for the wrong information.

Thanks!

@sanchin,
Thank you for the confirmation. We have asked our fellow worker to close this request.

Best Regards,
Imran Rafique

@sanchin

We have investigated the issue (PDFNET-42983) and found it was not a bug. The point is that fonts mentioned above contains incorrect ToUnicode tables in the descriptions. So, correct extraction of text is impossible for the document. Adobe Acrobat returns similar result. The ticket is closed now.