Wrong characters in the result of converting a PDF file into HTML, with font substitution

craig.w.su · July 24, 2017, 1:03am

Hi there

I am using Aspose PDF to covert PDF file into HTML format, with font substitution.

Here is the code I used for test:

Test Case:

@Test
public void asposeConvert() throws FileNotFoundException, IOException {

  // create font sub rule and
  TestFontSubRule subst = new TestFontSubRule();
  FontRepository.getSubstitutions().add(subst);
  
  String fileName = "10mincsiegraduate-160608073616.pdf";
  Document pdf = new Document("custom/input/pdf/" + fileName);
  
  File dir = new File("custom/output/pdf/" + fileName + "/");
  dir.mkdirs();

  HtmlSaveOptions htmlSaveOps = new HtmlSaveOptions();
  htmlSaveOps.RasterImagesSavingMode = HtmlSaveOptions.RasterImagesSavingModes.AsEmbeddedPartsOfPngPageBackground;
  htmlSaveOps.FontSavingMode = HtmlSaveOptions.FontSavingModes.AlwaysSaveAsWOFF;
  htmlSaveOps.PartsEmbeddingMode = HtmlSaveOptions.PartsEmbeddingModes.EmbedAllIntoHtml;
  htmlSaveOps.LettersPositioningMethod = LettersPositioningMethods.UseEmUnitsAndCompensationOfRoundingErrorsInCss;
  htmlSaveOps.setSplitIntoPages(false);
  htmlSaveOps.setPreventGlyphsGrouping(true);

  for (int p = 1; p <= pdf.getPages().size(); p++) {
  	Document pageDoc = new Document();
  	pageDoc.getPages().add(pdf.getPages().get_Item(p));

  	final StringBuilder htmlBuffer = new StringBuilder();
  	htmlSaveOps.CustomHtmlSavingStrategy = new HtmlSaveOptions.HtmlPageMarkupSavingStrategy() {
  		@Override
  		public void invoke(com.aspose.pdf.HtmlSaveOptions.HtmlPageMarkupSavingInfo htmlSavingInfo) {
  			try {
  				htmlBuffer.append(IOUtils.toString(htmlSavingInfo.ContentStream, "utf8"));
  			} catch (FileNotFoundException e) {
  			} catch (IOException e) {
  			} finally {
  				IOUtils.closeQuietly(htmlSavingInfo.ContentStream);
  			}
  		}
  	};

  	String outHtmlFile = "SomeUnexistingFile.html";
  	pageDoc.save(outHtmlFile, htmlSaveOps);
  	
  	String html = htmlBuffer.toString();
  	
  	IOUtils.write(html.getBytes("utf8"),
  			new FileOutputStream("custom/output/pdf/" + fileName + "/" + p + ".html"));
  }

}

TestFontSubRule class

public class TestFontSubRule extends CustomFontSubstitutionBase {
public boolean trySubstitute(
CustomFontSubstitutionBase.OriginalFontSpecification originalFontSpecification, /* out */
com.aspose.pdf.Font[] substitutionFont) {
System.out.println(originalFontSpecification.getOriginalFontName());
if (originalFontSpecification.getOriginalFontName().contains(“DFKaiShu”)) {
substitutionFont[0] = FontRepository.findFont(“HanWangMingLight”);
return true;
} else {
return false;
}
}
}

I met a PDF file, and there are some wrong characters in its result.
I uploaded the result, the PDF file, and the comparison image.
Please check the attachment and also this issue, thank you~
10mincsiegraduate-160608073616.pdf (207.0 KB)
comparison_page#2.JPG (50.0 KB)

(Result page files. Rename them like “*.zip.001” to unzip them)
10mincsiegraduate-160608073616.pdf.001.zip (3 MB)
10mincsiegraduate-160608073616.pdf.002.zip (3 MB)
10mincsiegraduate-160608073616.pdf.003.zip (3 MB)
10mincsiegraduate-160608073616.pdf.004.zip (3 MB)
10mincsiegraduate-160608073616.pdf.005.zip (532.6 KB)

Craig

imran.rafique · July 24, 2017, 11:26am

@craig.w.su,
We managed to replicate the problem of wrong characters in the output HTML pages. It has been logged under the ticket ID PDFJAVA-36936 in our bug tracking system. We have linked your post to this ticket and will keep you informed regarding any available updates. We are sorry for the inconvenience caused.

Best Regards,
Imran Rafique