We're sorry Aspose doesn't work properply without JavaScript enabled.

Free Support Forum - aspose.com

Characters are missing in the result of saving a pdf file into HTML format

Hi
I am using Aspose PDF 17.9 to save pdf files into HTML format.
Here is the code for test:

String fileName = "BOX_v4_20170929_1.pdf";
Document pdf = new Document("custom/input/pdf/" + fileName);
new File("custom/output/pdf/" + fileName + "/").mkdirs();

for (int p = 1; p <= pdf.getPages().size(); p++) {
	System.out.println("Page:" + p);
	Document pageDoc = new Document();
	pageDoc.getPages().add(pdf.getPages().get_Item(p));
	pageDoc.getPageInfo().setMargin(new MarginInfo(0, 0, 0, 0));

	HtmlSaveOptions htmlSaveOps = new HtmlSaveOptions();
	htmlSaveOps.RasterImagesSavingMode = HtmlSaveOptions.RasterImagesSavingModes.AsEmbeddedPartsOfPngPageBackground;
	htmlSaveOps.FontSavingMode = HtmlSaveOptions.FontSavingModes.AlwaysSaveAsWOFF;
	htmlSaveOps.PartsEmbeddingMode = HtmlSaveOptions.PartsEmbeddingModes.EmbedAllIntoHtml;
	htmlSaveOps.LettersPositioningMethod = LettersPositioningMethods.UseEmUnitsAndCompensationOfRoundingErrorsInCss;
	htmlSaveOps.setSplitIntoPages(false);
	htmlSaveOps.setPreventGlyphsGrouping(true);

	final StringBuilder htmlBuffer = new StringBuilder();
	htmlSaveOps.CustomHtmlSavingStrategy = new HtmlSaveOptions.HtmlPageMarkupSavingStrategy() {
		@Override
		public void invoke(HtmlPageMarkupSavingInfo htmlSavingInfo) {
			try {
				htmlBuffer.append(IOUtils.toString(htmlSavingInfo.ContentStream, "utf8"));
			} catch (Exception e) {
				e.printStackTrace();
			} finally {
				IOUtils.closeQuietly(htmlSavingInfo.ContentStream);
			}
		}
	};

	String outHtmlFile = "SomeUnexistingFile.html";
	pageDoc.save(outHtmlFile, htmlSaveOps);
	IOUtils.write(htmlBuffer.toString().getBytes("UTF-8"),
			new FileOutputStream("custom/output/pdf/" + fileName + "/" + p + ".html"));
}

Issue:
1.
In the result, there are several characters missing.
After we checked the result html file. we found that “visibility:hidden” is added to them.

  1. Although we remove “visibility:hidden”,
    some of the Chinese characters in html are not the same as the original pdf file.

result and images.zip (1.1 MB)
BOX_v4_20170929_1.pdf (249.3 KB)

I uploaded some image to describe the issue, the pdf file and the result.
Please check the attachment and this issue. Thank you

Craig

@craig.w.su,

We have tested your source PDF with the latest version 17.9 of Aspose.Pdf for Java API and managed to replicate the said issues. It has been logged under the ticket ID PDFJAVA-37194 in our bug tracking system. We have linked your post to this ticket and will keep you informed regarding any available updates.