Conversion PDF to HTML - new font set for every character

Hello,

While converting PDF to HTML(see the attached file), characters are not recognized as UTF-8 characters, thus font representation is being generated. However, for each “character” in pdf, there is a separate font-family generated. Is this an expected behavior and is there a possibility to use one font family for the whole pdf content when characters are not identified as valid UTF-8 characters? asposeScreen.jpg (327.2 KB)

Below fragment of CSS file corresponding to a selected letter in asposeScreen.jpg file:

Blockquote
@font-face {
font-family:“ASOCGB+CalibriRegular-Identity-H”;
src:url(“xxx”) format(“woff”);
}
.stl_355 {
line-height: 1.046555em;
font-size: 0.86em;
font-family: “ASOCGB+CalibriRegular-Identity-H”, “Times New Roman”;
color: #221E1F;
}

examplePdf.pdf (2.6 MB)

Regards

Tomasz

@top

Thanks for contacting support.

We have logged an investigation ticket as PDFNET-43529 in our issue tracking system, for the requirement of specifying single font, while creating HTML from PDF. Our product team will further look into this and share their feedback. As soon as we receive some updates from their side, we will let you know.

Furthermore, we have use Aspose.Pdf for .NET 17.10, in order to convert your PDF into HTML and obtained HTML output seemed better than that of which you shared in your screenshot. Please use latest version of the API to get better output as it is always recommended to use latest version. For your reference, we have attached generated output along with code snippet as well.

Document exportDoc = new Document(dataDir + "examplePdf.pdf");
HtmlSaveOptions htmlOptions = new HtmlSaveOptions();
htmlOptions.FontSavingMode = HtmlSaveOptions.FontSavingModes.AlwaysSaveAsTTF;
htmlOptions.FontEncodingStrategy = HtmlSaveOptions.FontEncodingRules.DecreaseToUnicodePriorityLevel;
exportDoc.Save(dataDir + "examplePdf.html", htmlOptions);

PdfExample.zip (966.7 KB)

We are sorry for the inconvenience.