Convert pdf to html spaces loss

Hi,
When I convert pdf to html using aspose.pdf java version, spaces loss.

In my case: Narrow linewidth lasers are necessary as local to Narrowlinewidthlasersarenecessaryaslocaloscil

Original pdf screenshot:
Image20190315113904.png (46.5 KB)
Converted html screenshot:
Image20190315113943.png (2.8 KB)

Here’s my test pdf:
100024.pdf (226.7 KB)

Here’s my testing code:

Document pdf = new Document(pdfFile.getAbsolutePath());

HtmlSaveOptions options = new HtmlSaveOptions();
options.setFixedLayout(false);
options.setSplitIntoPages(false);
options.FontSavingMode = HtmlSaveOptions.FontSavingModes.AlwaysSaveAsTTF;
options.RasterImagesSavingMode = HtmlSaveOptions.RasterImagesSavingModes.AsExternalPngFilesReferencedViaSvg;
options.PartsEmbeddingMode = HtmlSaveOptions.PartsEmbeddingModes.EmbedCssOnly;

pdf.save(htmlFile.getAbsolutePath(), options);

@titanseason

Thank you for contacting support.

Please install attached fonts in default font directory or set a path to the fonts using FontRepository.addLocalFontPath() or below function with Aspose.PDF for Java 19.2.

String path = "path/to/my/folder";
List<String> fontPaths = FontRepository.getLocalFontPaths();
fontPaths.add(path);
FontRepository.setLocalFontPaths(fontPaths);

100024Fonts.zip

We hope this will be helpful. Please feel free to contact us if you need any further assistance.

I didn’t install the fonts, but set a path to the fonts using FontRepository.addLocalFontPath(). still spaces loss.

Here’s my full test code:

public static int pdfToHtml(File pdfFile, File htmlFile) {
try {
addDefaultFonts(); // add fonts

        Document pdf = new Document(pdfFile.getAbsolutePath());

        HtmlSaveOptions options = new HtmlSaveOptions();
        options.setFixedLayout(false);
        options.setSplitIntoPages(false);
        options.setExtractOcrSublayerOnly(true);
        options.FontSavingMode = HtmlSaveOptions.FontSavingModes.AlwaysSaveAsTTF;
        options.RasterImagesSavingMode = HtmlSaveOptions.RasterImagesSavingModes.AsExternalPngFilesReferencedViaSvg;
        options.PartsEmbeddingMode = HtmlSaveOptions.PartsEmbeddingModes.EmbedCssOnly;

        pdf.save(htmlFile.getAbsolutePath(), options);

    } catch (Exception e) {
        e.printStackTrace();
        return -1;
    }

    return 0;
}

private static void addDefaultFonts() {
    URL url = AsposePDF.class.getClassLoader().getResource("fonts/100024Fonts");
    if (url == null) {
        return;
    }
    String path = url.getFile();
    for (int i = 1; i <= 10; i++) {
        File file = new File(path, "100024_font" + i + ".ttf");
        if (file.exists()) { // make sure file exists
            System.out.println(file.getAbsolutePath()); // in console log: file path is correct
            FontRepository.addLocalFontPath(file.getAbsolutePath());
        }
    }
}

I think, the key point is options.setFixedLayout(false);. If I set FixedLayout to false, spaces will loss and images missing. But when I set FixedLayout to true, spaces and images are all exist.

In my case, I need to set FixedLayout to false, to keep paragraph information. So please see if there are some bugs make spaces and images lost

@titanseason

We have logged a ticket with ID PDFJAVA-38436 for investigations of the problem. The ticket ID has been linked with this thread so that you will receive notification as soon as the ticket is resolved.

We are sorry for the inconvenience.

Any response for this issue?

@titanseason

Please note that the ticket is logged under free support model where tickets are scheduled on first come first serve basis. So the resolution of this ticket may take several months. We appreciate your patience and comprehension in this regard.

Moreover, we also offer Paid Support, where issues are used to be investigated with higher priority. Our customers, who have paid support subscription, report their issue there which are meant to be investigated urgently. In case your reported issue is a blocker, you may please consider subscribing for Paid Support. For further information, please visit Paid Support FAQs.