How to replace the custom embedded fonts by Arial font in PDF file by using Aspose PDF library

SenthilRG27 · June 20, 2023, 5:34pm

Hi,

I am trying to replace the custom embedded fonts by “Arial” font in the entire PDF file by using below code (Aspose PDF library). When I try to copy paste the contents from PDF to notepad, the text has been changed as below. Could you please help me to resolve this issue?



var pdfDocument = new Document(_fileInfo.FullName);
// Set EmbedStandardFonts property of document
pdfDocument.EmbedStandardFonts = true;
foreach (Aspose.Pdf.Page page in pdfDocument.Pages)
{
   if (page.Resources.Fonts != null)
   {
       foreach (Aspose.Pdf.Text.Font pageFont in page.Resources.Fonts)
       {
           // Check if font is already embedded
           if (!pageFont.IsEmbedded)
           {
               pageFont.IsEmbedded = true;
           }
       }
   }
}
TextFragmentAbsorber absorber = new TextFragmentAbsorber(new TextEditOptions(TextEditOptions.FontReplace.Default));

// Accept the absorber for all the pages
pdfDocument.Pages.Accept(absorber);
// Traverse through all the TextFragments
foreach (TextFragment textFragment in absorber.TextFragments)
{
     if (textFragment.TextState.Font.FontName == "Z@RA515.tmp")
     {
           textFragment.TextState.Font = FontRepository.FindFont("Arial");
           textFragment.TextState.ForegroundColor = Aspose.Pdf.Color.FromRgb(System.Drawing.Color.Blue);
           textFragment.TextState.BackgroundColor = Aspose.Pdf.Color.FromRgb(System.Drawing.Color.Green);
      }
 }
var fileNameWithOutExtension = Path.GetFileNameWithoutExtension(_fileInfo.FullName);
FileUtility.ValidateFileName(_outputFolderPath, fileNameWithOutExtension, ".pdf", out string newFileName); 
pdfDocument.Save(Path.Combine(_outputFolderPath, newFileName));

This Topic is created by amjad.sahi using Email to Topic tool.

asad.ali · June 20, 2023, 9:42pm

@SenthilRG27

Can you please share the sample document for our reference as well? We will test the scenario in our environment and address it accordingly.

SenthilRG27 · June 22, 2023, 6:30pm

Hi Asad,

Please find the attached sample pdf file and try to replace the embedded fonts with “Arial” font.

SampleFile.pdf (38.9 KB)

If you copy the text highlighted in image1.png from the attached PDF file, the contents displayed like below.

􀁓􀁉􀁗􀀄􀀴􀁅􀁖􀁘􀀄􀀭􀀄􀁅􀁔􀁔􀁐􀁝􀀣􀀄􀀭􀁊􀀄􀁱􀀽􀁉􀁗􀀐􀁲􀀄􀁇􀁓􀁑􀁔􀁐􀁉􀁘􀁉􀀄􀁅􀁒􀁈􀀄􀁅􀁘􀁘􀁅􀁇􀁌􀀄􀀴􀁅􀁖􀁘􀀄􀀭􀀄􀀒 􀀒 􀀒 􀀒 􀀒 􀀒 􀀒 􀀒 􀀒 􀀒 􀀒 􀀒 􀀒 􀀒 􀀒 􀀒 􀀒 􀀒 􀀒 􀀒 􀀕

image1.png (77.5 KB)

Also we have seen some fonts starting with Z@R***.tmp inside the PDF file.

image2.png (10.8 KB)

Please help to replace the fonts with “Arial” font in the PDF file.

asad.ali · June 22, 2023, 10:43pm

@SenthilRG27

We have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.

Issue ID(s): PDFNET-54903

You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.

SenthilRG27 · June 26, 2023, 2:56pm

Any updates please?

asad.ali · June 27, 2023, 12:00am

@SenthilRG27

We are afraid that there are no updates yet. Please note that the ticket was logged recently and it will be resolved on a first come first serve basis. As soon as we have some updates about its resolution, we will let you know. Please be patient and spare us some time.

We are sorry for the inconvenience.

SenthilRG27 · July 13, 2023, 7:50am

Any updates please?

asad.ali · July 13, 2023, 3:51pm

@SenthilRG27

We are afraid that the earlier logged ticket has not been yet resolved due to other issues in the queue. Nevertheless, we will inform you once we have some news about its resolution or fix ETA. Please spare us some time.

We are sorry for the inconvenience.

SenthilRG27 · July 28, 2023, 9:47am

I have raised this issue one month before. Still there is no fix available. We are using paid Aspose Total license file for our development works. We are not able to answer to the customers. Earlier I have raised lot of issues for Aspose Cells library. They have given immediate fix for the reported issues in a very short span of time. Please work on this and update me once done.

asad.ali · July 28, 2023, 5:44pm

@SenthilRG27

Please accept our humble apology for the inconvenience you have faced. Please note that the issues are resolved on first come first serve basis in free support model. The resolution time of an issue depends upon various factors and number of the issues logged prior to it. Nevertheless, we have logged your concerns and will surely consider them during investigation. We will try to plan the investigation as soon as possible and share the ETA with you. We again apologize for the inconvenience.

SenthilRG27 · August 15, 2023, 10:46am

Is there any update on the above requested issue? I am waiting for almost two months for the fix.

asad.ali · August 15, 2023, 4:17pm

@SenthilRG27

We have investigated the issue and found the following facts:

It is unable to extract text written with the fonts Z@RA515.tmp and Z@RA675.tmp from the SampleFile.pdf document because their declaration doesn’t contain any information to get a match between glyph (glyph ID) and character (Unicode). It isn’t a bug. PDF reference paragraph 9.10 says that text extraction operation cannot be available for some PDFs. It looks like the creator of the PDF deliberately took steps to prevent text from being extracted from the document. Adobe Acrobat is also not able to get text written with the fonts.
You cannot resolve the problem of getting text by the replacing font. Because the glyph IDs of the original font differ from the glyph IDs in the Arial font. Therefore, we cannot find a match between the glyphs of these fonts. Just like Unicode matching. Moreover performing the operation replaces glyphs with another one and leads to text corruption. We do not recommend performing font replacement in the case when the document doesn’t allow the correct extraction of text. It may lead to unexpected results.
The embedding fonts code part does nothing because the fonts of the original document are already embedded.

Finally: the fonts Z@RA515.tmp and Z@RA675.tmp of from the SampleFile.pdf document cannot be correctly replaced with any other font. And it isn’t a bug. We will consider the question about the explicit confirmation of the operation for a document of such type to avoid getting unexpected results.

The only solution we can propose for the SampleFile.pdf document is a workaround:

convert document to picture;
create a new document based on the picture;
perform an OCR to make the document searchable;
replace Helvetica font with Arial if you still need it.

It may be performed using a code snippet like the following:

string dataDir = @"C:\";

//Converting input PDF to image
var inputDoc = new Document(dataDir + "SampleFile.pdf");
Aspose.Pdf.Page srcPage = inputDoc.Pages[1];
var pngCreator = new Aspose.Pdf.Devices.PngDevice(new Aspose.Pdf.Devices.Resolution(300));
using (var sw1 = new System.IO.StreamWriter(dataDir + "SampleFile_p1.png"))
{
    pngCreator.Process(srcPage, sw1.BaseStream);
}

//Create new PDF with the image
var tempDoc = new Document();
var page = tempDoc.Pages.Add();
page.PageInfo.Margin = new MarginInfo(0, 0, 0, 0);
Aspose.Pdf.Image image = new Aspose.Pdf.Image();
image.File = dataDir + "SampleFile_p1.png";
tempDoc.Pages[1].Paragraphs.Add(image);
tempDoc.Save(dataDir + "SampleFile_p1_Img.pdf");

//Make PDF with image searchable
var searchableDoc = new Document(dataDir + "SampleFile_p1_Img.pdf");
searchableDoc.Convert(CallBackGetHocr);
searchableDoc.Save(dataDir + "SampleFile_p1_searchable.pdf");

//Replace font for Arial
var outDoc = new Document(dataDir + "SampleFile_p1_searchable.pdf");
var font = FontRepository.FindFont("Arial");
TextFragmentAbsorber absorber = new TextFragmentAbsorber(new TextEditOptions(TextEditOptions.FontReplace.Default));
outDoc.Pages.Accept(absorber);
foreach (TextFragment textFragment in absorber.TextFragments)
{
    textFragment.TextState.Font = FontRepository.FindFont("Arial");
}
outDoc.Save(dataDir + "SampleFile_p1_out.pdf");

string CallBackGetHocr(System.Drawing.Image img)
{
    string dataDir = @"C:\";
    img.Save(dataDir + "ocr.jpg");
    ///V3.02
    System.Diagnostics.ProcessStartInfo info = new System.Diagnostics.ProcessStartInfo(@"C:\Program Files\Tesseract-OCR\tesseract.exe");
    info.WindowStyle = System.Diagnostics.ProcessWindowStyle.Hidden;
    info.Arguments = string.Format("{0}ocr.jpg {0}out hocr", dataDir);
    System.Diagnostics.Process p = new System.Diagnostics.Process();
    p.StartInfo = info;
    p.Start();
    p.WaitForExit();
    StreamReader streamReader = new StreamReader(dataDir + "out.html");
    string text = streamReader.ReadToEnd();
    streamReader.Close();
    return text;
}

It produces correct PDF with extractable text with the Arial font on the invisible text layer. (See: SampleFile_p1_out.pdf, processing_result.png)
processing_result.png (258.1 KB)
SampleFile_p1_out.pdf (16.5 KB)

SenthilRG27 · August 22, 2023, 1:54pm

Thank you for the detailed solution.