PDF font names cannot be recognized

Hi There


I am using Aspose PDF 11.5.0
I was try to use the following code to fetch font names of all segment’s text

Document doc = new Document(“custom/input/pdf/1.pdf”);
TextFragmentAbsorber absorber = new TextFragmentAbsorber(
new TextEditOptions(
TextEditOptions.FontReplace.RemoveUnusedFonts));
doc.getPages().accept(absorber);
TextFragmentCollection textFragmentCollection = absorber
.getTextFragments();

for (Iterator iterator = textFragmentCollection
.iterator(); iterator.hasNext():wink: {
TextFragment textFragment = iterator.next();
String fontName = textFragment.getTextState().getFont()
.getFontName();
System.out.println(fontName);
}


The result is:
TimesNewRoman
TimesNewRoman
TimesNewRoman
¼Ð·¢Åé
TimesNewRoman,Bold
¼Ð·¢Åé
TimesNewRoman,Bold
TimesNewRoman,Bold
TimesNewRoman,Bold
TimesNewRoman,Bold
TimesNewRoman,Bold
TimesNewRoman,Bold
TimesNewRoman,Bold
TimesNewRoman,Bold
¼Ð·¢Åé
TimesNewRoman,Bold
¼Ð·¢Åé
TimesNewRoman,Bold
¼Ð·¢Åé
TimesNewRoman,Bold
TimesNewRoman,Bold
TimesNewRoman,Bold
¼Ð·¢Åé
TimesNewRoman,Bold
TimesNewRoman,Bold


There are some font names cannot be recognized as a normal name.
It seems like some kind of garbled name instead.

Please check and fix this, thanks :slight_smile:

Hi Craig,

Thanks for your inquiry. I have tested your scenario with shared document using Aspose.Pdf for .NET 11.6.0 and managed to observe the wrong font name detection issue. For further investigation, I have logged an issue in our issue tracking system as PDFNEWNET-40755 and also linked your request to it. We will keep you updated via this thread regarding the issue status.

We are sorry for the inconvenience caused.

@craig.w.su

Thanks for your patience.

The problem was that input font name had non-standard representation in PDF. Please, see the difference how this name displayed by Adobe Acrobat and Foxit Reader(Acrobat.png and Foxit.png attached).

Acrobat.png (36.2 KB)
Foxit.png (37.2 KB)

This font name represented in hexadecimal form and direct decoding on ASCII symbols is not correct for this case, it is necessary to decode font name by rules specifically for this font.
New property “DecodedFontName” was added into Aspose.Pdf.Text.Font class, which helps to get font name in a readable form.

Code snippet to get readable font names can be like this:

public static void ReadFonts()
{
    Document pdf = new Document(@"C:\Users\Home\Downloads\1 (2).pdf");

            Dictionary<string, string> fontNames = new Dictionary<string, string>();

            for (int i = 1; i <= pdf.Pages.Count; i++)
            {
                {
                    foreach (Aspose.Pdf.Text.Font font in pdf.Pages[i].Resources.Fonts)
                    {
                        string fontName = font.DecodedFontName;
                        if (!fontNames.ContainsKey(fontName))
                            fontNames.Add(fontName, fontName);

                    }
                }
                if (pdf.Pages[i].Resources.Forms.Count > 0)
                    ProcessXForms(pdf.Pages[i].Resources.Forms, fontNames);

            }

            foreach (string fontName in fontNames.Keys)
            {
                Console.WriteLine("Font {0} on page resource", fontName);
            }

            pdf.Dispose();
}

        public static void ProcessXForms(Aspose.Pdf.XFormCollection forms, Dictionary<string, string> fontNames)
        {
            foreach (Aspose.Pdf.XForm form in forms)
            {
                if (form.Resources.Fonts != null)
                {
                    foreach (Aspose.Pdf.Text.Font font in form.Resources.Fonts)
                    {
                        string fontName = font.DecodedFontName;
                        if (!fontNames.ContainsKey(fontName))
                            fontNames.Add(fontName, fontName);
                    }
                    // recursive call     
                    if (form.Resources.Forms.Count > 0)
                        ProcessXForms(form.Resources.Forms, fontNames);
                }
            }
        }

The above snippet produces next output:

Font TimesNewRoman on page resource
Font NEPBJB+標楷體 on page resource
Font NEPBPB+TimesNewRoman,Bold on page resource

Please try using the latest release version Aspose.Pdf for .NET 17.12 and in case you face any issue, please feel free to contact us.