Underline support in markdown

lecoye4578 · September 28, 2023, 6:01am

We have html code that contains underline words. We need to convert it to markdown, but we discovered that we don’t have underline support. Please tell me how we can do this?

alexey.noskov · September 28, 2023, 7:23am

@lecoye4578 Markdown doesn’t have a defined syntax to underline text. So Underlined text is not preserved in Markdown. Could you please let us know how you would like to export underlined text to Markdown?

lecoye4578 · September 28, 2023, 7:34am

Different versions of markdown, for example in Git, have their own treatments for this. For example, we want to do this ++underline text++ and this will be the text for the underline. Please tell me how best we can do this using aspose words?

alexey.noskov · September 28, 2023, 8:02am

@lecoye4578 We will consider adding a feature to specify custom tags for basic formatting upon exporting document to Markdown.
We have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.

Issue ID(s): WORDSNET-25997

You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.

At the moment, you can use the following code:

Document doc = new Document(@"C:\Temp\in.html");
List<Run> UnderlinedRuns = doc.GetChildNodes(NodeType.Run, true).Cast<Run>()
    .Where(r => r.Font.Underline != Underline.None).ToList();
foreach (Run r in UnderlinedRuns)
{
    r.ParentNode.InsertBefore(new Run(doc, "++"), r);
    r.ParentNode.InsertAfter(new Run(doc, "++"), r);
}
doc.Save(@"C:\Temp\out.md");

lecoye4578 · September 28, 2023, 11:34am

Thank you, this is a good example, but tell me what to do if we have a markdown with an underline and we need to convert it to html?

alexey.noskov · September 28, 2023, 1:34pm

@lecoye4578 You can use find/replace functionality to search for text between custom tags and apply custom formatting. For example the following MD:

This some **md** with *formatting* and ++underlined++ text

Processed by the following code:

Document doc = new Document(@"C:\Temp\in.md");
            
FindReplaceOptions opt = new FindReplaceOptions();
opt.UseSubstitutions = true;
opt.ApplyFont.Underline = Underline.Single;
doc.Range.Replace(new Regex(@"\+\+([^\+]+)\+\+"), "$1", opt);
            
doc.Save(@"C:\Temp\out.html");

Will produce the following HTML: out.zip (445 Bytes)

lecoye4578 · October 5, 2023, 9:41am

Thank you very much for the answers, they helped a lot.
I have another problem, I transfer pure HTML to convert it into markdown, and the output is just as good markdown. But after converting markdown to HTML, I get a very large HTML with different styles, which later the editors display incorrectly. Tell me, can I get the same pure HTML?

Html to md

<ol><li>123</li><li>456</li></ol><ul><li>123</li><li>456</li></ul>

Md

Md to html

<html><head><meta http-equiv="Content-Type" content="text/html; charset=utf-8" /><meta http-equiv="Content-Style-Type" content="text/css" /><meta name="generator" content="Aspose.Words for .NET 21.1.0" /><title></title></head><body style="font-family:Calibri; font-size:11pt"><div><ol type="1" style="margin:0pt; padding-left:0pt"><li style="margin-left:31.35pt; margin-bottom:8pt; padding-left:4.65pt"><span>123</span></li><li style="margin-left:31.35pt; margin-bottom:8pt; padding-left:4.65pt"><span>456</span></li></ol><p style="margin-top:0pt; margin-left:36pt; margin-bottom:8pt; text-indent:-18pt; -aw-import:list-item; -aw-list-level-number:0; -aw-list-number-format:'-'; -aw-list-number-styles:'bullet'; -aw-list-padding-sml:11.4pt"><span style="-aw-import:ignore"><span style="font-family:'Courier New'">-</span><span style="font:7pt 'Times New Roman'; -aw-import:spaces">&#xa0;&#xa0;&#xa0;&#xa0;&#xa0;&#xa0;&#xa0; </span></span><span>123</span></p><p style="margin-top:0pt; margin-left:36pt; margin-bottom:8pt; text-indent:-18pt; -aw-import:list-item; -aw-list-level-number:0; -aw-list-number-format:'-'; -aw-list-number-styles:'bullet'; -aw-list-padding-sml:11.4pt"><span style="-aw-import:ignore"><span style="font-family:'Courier New'">-</span><span style="font:7pt 'Times New Roman'; -aw-import:spaces">&#xa0;&#xa0;&#xa0;&#xa0;&#xa0;&#xa0;&#xa0; </span></span><span>456</span></p></div></body></html>

alexey.noskov · October 5, 2023, 1:08pm

@lecoye4578 While reading markdown document the document is loaded into Aspose.Words DOM, which is designed to work with MS Word documents. Upon saving to HTML, Aspose.Words exports such document as if the source document was MS Word document. That is why the output HTML document has a lot of styles. I am afraid there is no direct way to export document as “clean” HTML. However, you can implement your own exported using DocumentVisitor. For example here is a simplified converter:

Document doc = new Document(@"C:\Temp\in.md");
doc.UpdateListLabels();
DocumentToSimpleHtmlConverter converter = new DocumentToSimpleHtmlConverter();
doc.Accept(converter);
Console.WriteLine(converter.GetHtml());

internal class DocumentToSimpleHtmlConverter : DocumentVisitor
{
    public DocumentToSimpleHtmlConverter()
    {
        Reset();
    }

    public void Reset()
    {
        mBuilder = new StringBuilder();
    }

    public string GetHtml()
    {
        return mBuilder.ToString();
    }

    public override VisitorAction VisitDocumentStart(Document doc)
    {
        mBuilder.Append("<html>");
        mBuilder.Append("<body>");
        return VisitorAction.Continue;
    }

    public override VisitorAction VisitDocumentEnd(Document doc)
    {
        mBuilder.Append("</body>");
        mBuilder.Append("</html>");
        return VisitorAction.Continue;
    }

    public override VisitorAction VisitBodyStart(Body body)
    {
        mBuilder.Append("<div>");
        return VisitorAction.Continue;
    }

    public override VisitorAction VisitBodyEnd(Body body)
    {
        mBuilder.Append("</div>");
        return VisitorAction.Continue;
    }

    public override VisitorAction VisitParagraphStart(Paragraph paragraph)
    {
        Paragraph prevParagraph = paragraph.PreviousSibling as Paragraph;
        if (paragraph.IsListItem)
        {
            if (prevParagraph == null ||
                !prevParagraph.IsListItem ||
                prevParagraph.ListFormat.List != paragraph.ListFormat.List ||
                prevParagraph.ListFormat.ListLevelNumber != paragraph.ListFormat.ListLevelNumber)
            {
                // There might be other bullet types for demonstration purposes use only simple bullet in the condition.
                // Node: to use paragraph.ListLabel it is required to call Document.UpdateListLabels().
                if (paragraph.ListLabel.LabelString == "\x2022" || paragraph.ListLabel.LabelString == "-")
                    mBuilder.Append("<ul>");
                else
                    mBuilder.Append("<ol>");
            }
        }

        mBuilder.Append(paragraph.IsListItem ? "<li>" : "<p>");
        return VisitorAction.Continue;
    }

    public override VisitorAction VisitParagraphEnd(Paragraph paragraph)
    {
        mBuilder.Append(paragraph.IsListItem ? "</li>" : "</p>");

        Paragraph nextParagraph = paragraph.NextSibling as Paragraph;
        if (paragraph.IsListItem)
        {
            if (nextParagraph == null ||
                !nextParagraph.IsListItem ||
                nextParagraph.ListFormat.List != paragraph.ListFormat.List ||
                nextParagraph.ListFormat.ListLevelNumber != paragraph.ListFormat.ListLevelNumber)
            {
                // There might be other bullet types for demonstration purposes use only simple bullet in the condition.
                // Node: to use paragraph.ListLabel it is required to call Document.UpdateListLabels().
                if (paragraph.ListLabel.LabelString == "\x2022" || paragraph.ListLabel.LabelString == "-")
                    mBuilder.Append("</ul>");
                else
                    mBuilder.Append("</ol>");
            }
        }

        return VisitorAction.Continue;
    }

    public override VisitorAction VisitRun(Run run)
    {
        mBuilder.Append("<span>");
        mBuilder.Append(run.Text);
        mBuilder.Append("</span>");
        return VisitorAction.Continue;
    }

    public override VisitorAction VisitTableStart(Table table)
    {
        mBuilder.Append("<table>");
        return VisitorAction.Continue;
    }

    public override VisitorAction VisitTableEnd(Table table)
    {
        mBuilder.Append("</table>");
        return VisitorAction.Continue;
    }

    public override VisitorAction VisitRowStart(Row row)
    {
        mBuilder.Append("<tr>");
        return VisitorAction.Continue;
    }

    public override VisitorAction VisitRowEnd(Row row)
    {
        mBuilder.Append("</tr>");
        return VisitorAction.Continue;
    }

    public override VisitorAction VisitCellStart(Cell cell)
    {
        mBuilder.Append("<td>");
        return VisitorAction.Continue;
    }

    public override VisitorAction VisitCellEnd(Cell cell)
    {
        mBuilder.Append("</td>");
        return VisitorAction.Continue;
    }

    // Override other visitXXX methods to export whole document structure.

    private StringBuilder mBuilder;
}

Here is the output produced by this code:

<html>
<body>
    <div>
        <ol>
            <li><span>123</span></li>
            <li><span>456</span></li>
        </ol><ul>
            <li><span>123</span></li>
            <li><span>456</span></li>
        </ul>
    </div>
</body>
</html>

lecoye4578 · October 11, 2023, 1:28pm

Good afternoon. We have a problem when converting to Word and rtf from Markdovn. We have various texts in which there are empty lines, the example given below. After using the text, it turns out that we have no empty lines left and indents have been generated, which is critical for us. If there were more than 2 indents, they disappear and only one indent remains. Please tell me how to solve this problem?

Problem image -

Markdown

Some text start

1 empty line



3 empty lines end

alexey.noskov · October 11, 2023, 2:50pm

@lecoye4578
We have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.

Issue ID(s): WORDSNET-26064

You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.

lecoye4578 · October 12, 2023, 11:04am

Can I do this without implementing my own converter? To the receiver using character replacement?

My research showed that you are creating a new paragraph and it is created only 1 time, regardless of how many times a new line appears in the text

alexey.noskov · October 12, 2023, 12:23pm

@lecoye4578 As a temporary workaround, you can put some sequence into an empty lines to force Aspose.Words to preserve them, and then remove it from the resulting doucment.

const string dummySequence = "THISISNOTEMPTYLINE";
string md = File.ReadAllText(@"C:\Temp\in.md");
md = md.Replace("\r\n", dummySequence + "\r\n");

using (MemoryStream mdStream = new MemoryStream(Encoding.UTF8.GetBytes(md)))
{
    LoadOptions opt = new LoadOptions();
    opt.LoadFormat = LoadFormat.Markdown;
    Document doc = new Document(mdStream, opt);
    doc.Range.Replace(dummySequence, "");
    doc.Save(@"C:\Temp\out.docx");
}

lecoye4578 · October 17, 2023, 8:04am

Everything I used above worked for me, thank you, although I also had to write custom casts for new lines. I still have one last problem with the fact that I have text that is both bold italic and underline. When the underline tags are corrected, the text that was bold italic disappears. I noticed that our text is divided into several nodes. I think that because of this, the problem occurs that everything that was in them is erased. How can I solve this problem?

Example MD:
123 ^^***123 123***^^ 123

alexey.noskov · October 17, 2023, 8:52am

@lecoye4578 If you have control over MD creation process, you can resolve the problem by changing the order of tags like this:

123 ***^^123 123^^*** 123

lecoye4578 · October 17, 2023, 9:29am

Working fine, a lot of thanks

lecoye4578 · October 27, 2023, 11:38am

Good afternoon, we noticed a lot of problems when converting RTF to MD format and in the other direction. One of the problems is that we do not save spaces after conversion. I made an example below. In the first case, I entered 123 and pressed enter 3 times, in the second case, I also entered 123 and pressed Shift and enter 3 times. Tell me how we can fix this problem?

We also noticed that you do not have support for \line when converting to MD and because of this the translation is not correct. Since in RTF there are \par and \line, but this support is in MD. How can we solve this problem?

Enter:
"123\r\n\r\n\r\n\r\n"

Shift+Enter
"123\r\n"

alexey.noskov · October 27, 2023, 12:03pm

@lecoye4578 This is specific of Markdown format. A soft like break (Shift+Enter) is exported as \r\n, and a paragraph break (Enter) is exported as \r\n\r\n. Markdown consumers does ignores redundant empty lines so for example \r\n\r\n\r\n\r\n is still interpreted as a single paragraph break. So Aspose.Words does not export empty paragraphs to Markdown format. Also, you should note that Markdown format is quite limited compared to RTF or any other MS Word formats.

lecoye4578 · October 28, 2023, 12:34am

Yes, but the question is, our empty lines that we need are missing, and they are critical for our documents. How can I make sure they don’t disappear?

alexey.noskov · October 28, 2023, 5:18am

@lecoye4578
We have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.

Issue ID(s): WORDSNET-26148

You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.

aspose.notifier · December 29, 2023, 12:48pm

The issues you have found earlier (filed as WORDSNET-25997) have been fixed in this Aspose.Words for .NET 24.1 update also available on NuGet.