Hi Aspose team
We have PDF files converted into HTML file format for cross-platform reading with Aspose PDF 11.7.0.
Then there is a situation that some of characters stick together, which is quite different from the origin PDF file, and unable to read as usual.
Here is the code we used for test:
Document pdf = new Document(“custom/input/pdf/p7_1.pdf”);
for(int p = 1; p<=pdf.getPages().size();p++){
Document pageDoc = new Document();
pageDoc.getPages().add(pdf.getPages().get_Item§);
HtmlSaveOptions htmlSaveOps = new HtmlSaveOptions();
htmlSaveOps.RasterImagesSavingMode = HtmlSaveOptions.RasterImagesSavingModes.AsEmbeddedPartsOfPngPageBackground;
htmlSaveOps.FontSavingMode = HtmlSaveOptions.FontSavingModes.AlwaysSaveAsWOFF;
htmlSaveOps.PartsEmbeddingMode = HtmlSaveOptions.PartsEmbeddingModes.EmbedAllIntoHtml;
htmlSaveOps.LettersPositioningMethod = LettersPositioningMethods.UseEmUnitsAndCompensationOfRoundingErrorsInCss;
htmlSaveOps.setSplitIntoPages(false);
final ByteArrayOutputStream stream = new ByteArrayOutputStream();
htmlSaveOps.CustomHtmlSavingStrategy = new HtmlSaveOptions.HtmlPageMarkupSavingStrategy() {
@Override
public void invoke(
com.aspose.pdf.HtmlSaveOptions.HtmlPageMarkupSavingInfo htmlSavingInfo) {
byte[] resultHtmlAsBytes = new byte[(int) htmlSavingInfo.ContentStream
.getLength()];
htmlSavingInfo.ContentStream.read(resultHtmlAsBytes, 0,
resultHtmlAsBytes.length);
try {
stream.write(resultHtmlAsBytes);
stream.close();
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
}
};
String outHtmlFile = “SomeUnexistingFile.html”;
pageDoc.save(outHtmlFile, htmlSaveOps);
IOUtils.write(stream.toByteArray(), new FileOutputStream(“custom/output/pdf/p7_1.”+p+".html"));
}
Is there any option to fix this?
BTW, The Chinese text in this PDF are arranged vertically. Hope this information helps.
I 've uploaded attachments which contains the origin PDF file and the result HTML file, Please check this, thank you.
Best,
Craig