Parse the full document to markdown with style and formatting

We have a requirement where we need to have a markdown file of a docx file in order to process it properly with formatting. However, it seems as though when you convert a file from docx to markdown it does not preserve the original numbering if they are lists or nested lists. The library considers all numbers as 1, now this is pretty accurate if the renderer is also markdown.

Since we are dealing with text here and proper formatting it now becomes a requirement for us to have proper numbering in the markdown text to track changes. is there a way that can be done

The current code we use is



import com.aspose.words.*;

public class ExMarkdownSaveOptions {

    public void convertDocxToMarkdownWithFormatting() throws Exception {
        // Load your DOCX file
        Document doc = new Document("input.docx");

        // Set up Markdown save options to preserve styles, numbering, and formatting
        MarkdownSaveOptions options = new MarkdownSaveOptions();
        options.setListExportMode(MarkdownListExportMode.MARKDOWN_SYNTAX); // Preserves numbering and bullets
        options.setTableContentAlignment(TableContentAlignment.AUTO); // Preserves table alignment
        options.setExportImagesAsBase64(true); // Embeds images as base64
        options.setExportHeadersFootersMode(ExportHeadersFootersMode.PER_SECTION); // Exports headers/footers
        options.setPrettyFormat(true); // Makes output more readable

        // Save as Markdown
        doc.save("output.md", options);
    }

    public static void main(String[] args) {
        try {
            new ExMarkdownSaveOptions().convertDocxToMarkdownWithFormatting();
            System.out.println("Conversion complete.");
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

how do we preserve all the lettering and numbering of the lists like in this text for example, while converting it to markdown?

1. All information disclosed by one party to the other to evaluate the Project and/or Purpose that is designated in writing as “Confidential” at the time of disclosure or if disclosed orally is designated in writing as “Confidential” within fifteen (15) days of disclosure is “Confidential Information.”  Confidential Information does not include information which:

(a) was known by the receiving party prior to receipt from the disclosing party;
(b) is generally available in the public domain or thereafter becomes available to the public through no act of the receiving party;
(c) is independently discovered by an employee, agent, or representative of the receiving party who had no knowledge of the Confidential Information disclosed; or
(d) is made available to the receiving party as a matter of lawful right by a third party.

2. The receiving party agrees to disclose Confidential Information only to their respective employees, agents, or representatives who have been determined to have a need to know and have been advised of their obligation to comply with the terms of this Agreement.  To the extent allowed by the law applicable to the receiving party, the receiving party will be liable for any breach of this agreement by any of its employees, agents, affiliates or representatives that receive access to the Co

@ianjwilson

To preserve the original numbering and formatting of lists when converting a DOCX file to Markdown using Aspose.Words, you can utilize the MarkdownSaveOptions class effectively. Your current implementation is on the right track, but you need to ensure that the ListExportMode is set correctly to maintain the numbering format.

Here’s how you can modify your existing code to ensure that the list numbering is preserved:

import com.aspose.words.*;

public class ExMarkdownSaveOptions {

    public void convertDocxToMarkdownWithFormatting() throws Exception {
        // Load your DOCX file
        Document doc = new Document("input.docx");

        // Set up Markdown save options to preserve styles, numbering, and formatting
        MarkdownSaveOptions options = new MarkdownSaveOptions();
        options.setListExportMode(MarkdownListExportMode.MARKDOWN_SYNTAX); // Use Markdown syntax for lists
        options.setTableContentAlignment(TableContentAlignment.AUTO); // Preserves table alignment
        options.setExportImagesAsBase64(true); // Embeds images as base64
        options.setExportHeadersFootersMode(ExportHeadersFootersMode.PER_SECTION); // Exports headers/footers
        options.setPrettyFormat(true); // Makes output more readable

        // Save as Markdown
        doc.save("output.md", options);
    }

    public static void main(String[] args) {
        try {
            new ExMarkdownSaveOptions().convertDocxToMarkdownWithFormatting();
            System.out.println("Conversion complete.");
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

Key Points:

  1. List Export Mode: Ensure that you set options.setListExportMode(MarkdownListExportMode.MARKDOWN_SYNTAX);. This setting is crucial as it allows the library to export lists in a way that preserves their numbering and bullet points in Markdown format.

  2. Testing the Output: After running the conversion, check the output.md file to verify that the lists are formatted correctly. The numbering should reflect the original structure from the DOCX file.

  3. Handling Nested Lists: If you have nested lists, the Markdown syntax should automatically handle them correctly as long as the MarkdownListExportMode is set to MARKDOWN_SYNTAX.

By following these steps, you should be able to maintain the original numbering and formatting of lists in your Markdown output. If you encounter any specific issues with certain documents, consider checking the structure of those documents to ensure they are formatted correctly in DOCX before conversion.

If you have further questions or need additional assistance, feel free to ask!

The export mode is set correctly in our code with all the other possible requirements taken into account as well. It does not however export it properly.

@ianjwilson It is expected that list labels are exported like this in Markdown. It is Markdown interpreter’s responsibility to recalculate actual list labels. You can try replacing list labels with simple text to preserve their values in the output Markdown:

Document doc = new Document("C:\\Temp\\in.docx");

// Update list labels.
doc.updateListLabels();

// Convert list items into regular paragraphs with leading text that imitates numbering.
for (Paragraph p : (Iterable<Paragraph>)doc.getChildNodes(NodeType.PARAGRAPH, true))
{
    if (p.isListItem())
    {
        String label = p.getListLabel().getLabelString() + "\t";
        Run fakeListLabelRun = new Run(doc, label);
        p.getListFormat().removeNumbers();
        p.prependChild(fakeListLabelRun);
    }
}

doc.save("C:\\Temp\\out.md");