Convert Word DOCX Document to PDF using Java & Make Text Content Selectable & Retain Fonts Formatting

Hi there,

We have been trying to render a Word document - with embedded PDF charts/visuals in it - to a final exported PDF. When we render the word document to PDF via desktop Word, then the PDF comes out with the embedded components in it as vectors and importantly selectable text. This is important for us because of PDF accessibility. However, when we render the PDF via aspose we see different and incorrect output.

  • Sometimes the embedded PDF text content is converted to curves and are no longer selectable or readable by a screen reader in the PDF.
  • Sometimes the embedded PDF text content remains selectable and readable, but the font is discarded.

We need to match the output from Word in the sense that embedded PDF text content in Word is represented in the final exported PDF with:

  • Selectable/readable text
  • The correct font.

The code used to generate the pdf is as follows:

private static void wordToPDF(String filename) {
    try {
        PdfSaveOptions opts = new PdfSaveOptions();
        Document docInput1 = new Document(filename);
        docInput1.save(filename + "-output.pdf", opts);

    } catch (Exception e) {
        e.printStackTrace();
    }
}

I’ve attached the files below:

aspose-selectable-text.zip (2.4 MB)

@kurtosys,

I have converted the source Word document “Doc5.docx” to PDF format by using MS Word 2019 (on Windows 10) and the latest (20.10) version of Aspose.Words for Java and attached the PDF files here for your reference:

Do you still see the same problem(s) in 20.10 generated PDF? If yes, then please create and attach a comparison screenshot which highlights the problematic area(s) in this Aspose.Words 20.10 generated PDF (with respect to MS Word 2019 generated PDF). I will then investigate the issue further and provide you more information.

The text is not selectable in both the word and aspose produced files you provided. You use word 2019 on windows, correct? We used word 16.37 on Max OSX to convert the docx to pdf on our end. The pdf produced from that version has all it’s text selectable.

@kurtosys,

We have logged this problem in our issue tracking system. Your ticket number is WORDSNET-21310. We will further look into the details of this problem and will keep you updated on the status of the linked issue. We apologize for any inconvenience.

Thank you. We will check in periodically for updates.

Any further feedback on this issue?

@kurtosys,

Unfortunately, WORDSNET-21310 is not resolved yet. This issue is currently pending for analysis and is in the queue. We will inform you via this forum thread as soon as this issue will get resolved in future. We apologize for your inconvenience.

Any feedback on the progress of this issue?

@kurtosys,

I am afraid, your issue is still pending for analysis and is in the queue. Please spare us some time. We will inform you via this forum thread as soon as this issue will get resolved in future. We apologize for any inconvenience.

@kurtosys,

This problem (WORDSNET-21310) actually requires us to implement a new feature in Aspose.Words API and we regret to share with you that the implementation of this issue has been postponed till a later date. However, the fix of this problem may definitely come onto the product roadmap in the future. Unfortunately, we cannot currently promise a resolution date (ETA). We apologize for your inconvenience. Please check the following analysis details:

Images in the document are in EMF format with embedded PDF documents (inside EMR_COMMENT_MULTIFORMATS record). The problem with not selectable text appears because EMF part of the images contains vector graphics instead of text.

The only way to make the text selectable is to process embedded PDF document. Unfortunately, we currently cannot start work on parsing and rendering of the embedded PDF document for now. As a workaround you can try to extract the PDF document from the metafile (it seems relatively easy, and we may help you to implement this task) and then convert it to vector graphics via third-party libraries (like Aspose.PDF).