Extract text from PDF using C# and Aspose.PDF - Invalid Font Exception is thrown

dbartle · November 10, 2020, 3:04pm

We have a PDF for which we get an exception: System.ArgumentException: Invalid font name when using TextAbsorber:
thePdfDocument.Pages[i].Accept(textAbsorber);

The release notes below seem to indicate that this was fixed, but I still get that exception even with the latest version (20.11).

asad.ali · November 10, 2020, 10:53pm

@dbartle

Would you please share the issue ID or forum thread link where the issue was reported actually?

dbartle · November 12, 2020, 3:03pm

I did not previously report it; when we encountered the error I saw that it was listed in the Release Notes as fixed, so I upgraded to the version that reported it fixed and still saw the issue.

dbartle · November 12, 2020, 3:05pm

PDFNET-48584 ArgumentException: Invalid font name on Pages.Accept(); Bug

asad.ali · November 13, 2020, 12:06am

@dbartle

Please note that some issues are related to particular PDF documents and are fixed only for them. However, the referenced issue was investigated earlier and was found as NOT A BUG in the API because the used font was not present in the culprit PDF file which caused the exception.

We also have introduced an option in the API to suppress such error (when used font is not present in the PDF resources and it also shows error in Adobe Reader while opening). If you need to continue document processing regardless of the error, you can do it from now. We have added new option IgnoreResourceFontErrors into TextSearchOptions. Meanings:

true - means that errors of absence of font will be ignored. Text segments that refer to incorrect resources will be skipped during processing.
false (default) - absence of font error will terminate processing by throwing exception (as earlier).

We used the following code for testing:

Document doc = new Document(GetInputPath("Aspose.Pdf/48584.pdf"));

TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber();
TextSearchOptions textSearchOptions = new TextSearchOptions(false);
textSearchOptions.IgnoreResourceFontErrors = true;
textFragmentAbsorber.TextSearchOptions = textSearchOptions;

doc.Pages[1].Accept(textFragmentAbsorber);
TextFragmentCollection textFragmentCollection = textFragmentAbsorber.TextFragments;
int count = textFragmentCollection.Count; // not zero
File.WriteAllText(GetOutputPath("48584.txt"), textFragmentAbsorber.Text);

cormack.milyli · July 13, 2022, 3:30pm

We’re on version 21.9.0, and we were hitting an “Invalid Font Name” exception when extracting the text form a particular document. We attempted to use the TextSearchOptions.IgnoreResourceFontErrors = true solution outlined above on both TextAbsorber and TextFragmentAbsorber to no avail. We’ve even attempted upgrading to 21.12, 22.1, 22.6 and downgrading to 20.11 and 20.10, all to the same effect.

asad.ali · July 13, 2022, 9:13pm

@cormack.milyli

Could you please share your sample PDF document for our reference? We will test the scenario in our environment and address it accordingly.