Convert from HTML to word and then to PDF losing format

I am trying to convert the online form submission (HTML) from our application to word and then to pdf formats.

  1. HTML to word, the input (checkbox) are all dropped among other things

  2. Word to PDF, the formatting is totally messed up, the table layout is not able to render colspans, it is considered as seperate cells and the content duplicated.

Attached are 2 files, the OriginalResult.html is the file that I would like to convert to word and pdf, the Result.html is another attempt to overcome the problem.

I would like to work with the original file.

Please assist asap, as this is a critical delivery for out application.

Hi Sunil,
Sorry for the delay, the check boxes are missing from the document and PDF because Input tags are not supported right now. We will keep you informed regarding developments of this. There is a work around for this in the mean time, you can use this code below. Right now it will only insert check boxes as unchecked by default, there needs to be a little more code added for it to retain whether or not it’s checked or not.

string html = File.ReadAllText(dataDir + "OriginalResult.html");
Document doc = new Document();
DocumentBuilder builder = new DocumentBuilder(doc);
Regex regex = new Regex("\<INPUT.*?( type=["'] ? ( ? \w * )["']?)?.*?\>", RegexOptions.IgnoreCase);
// Replace input tags with placeholders
MatchCollection matches = regex.Matches(html);
foreach(Match m in matches)
    {
        string replacement = string.Empty;
        string text = m.ToString();
        string typeText = "type="
        ";
        int startPos = text.IndexOf("type="
                ") + typeText.Length;
                int endPos = text.IndexOf(""
                    ", startPos);
                    string fieldType = text.Substring(startPos, endPos - startPos);
                    // by default INPUT tag is textbox, so if type of it is not specified we should insert textbox
                    if (string.IsNullOrEmpty(fieldType))
                        fieldType = "text";
                    // Placeholders will look like the following <%text%>,
                    // tags are needed that placeholders will be separate runs
                    if (fieldType == "text" || fieldType == "checkbox")
                        replacement = string.Format("<%{0}%>", fieldType); html = html.Replace(m.Value, replacement);
                }
                builder.InsertHtml(html); Regex placeholderRegex = new Regex("<%(?\w*)%>", RegexOptions.IgnoreCase); doc.Range.Replace(placeholderRegex, new ReplaceEvaluator(ReplaceEvaluatorinsertFormFields), false); doc.Save(dataDir + "OriginalResult Out.pdf"); private static ReplaceAction ReplaceEvaluatorinsertFormFields(object sender, ReplaceEvaluatorArgs e)
                {
                    // This is a Run node that contains placeholder string
                    Node currentNode = e.MatchNode;
                    // Create DocumentBuilder
                    DocumentBuilder builder = new DocumentBuilder((Document) currentNode.Document);
                    // Move documentBuilder cursor to th ematched node and insert formfield
                    builder.MoveTo(currentNode);
                    // Every formfield in word docuemnt has bookmark asociated with the formfield
                    // There cannot be two bookmarks with the same name in the document
                    // So we should generate unique nema for each formfield
                    string name = Guid.NewGuid().ToString("N");
                    // Insert formfield
                    if (e.Match.Groups["type"].Value == "text")
                        builder.InsertTextInput(name, TextFormFieldType.RegularText, string.Empty, " ", 0);
                    else if (e.Match.Groups["type"].Value == "checkbox")
                        builder.InsertCheckBox(name, false, 10);
                    // Signal to the replace engine to do replace matshed text.
                    return ReplaceAction.Replace;
                }

This code is a modified version of Alexey’s code found here.
As for the PDF appearing messed in output, I can’t seem to reproduce that on my side. It looked almost the same as the HTML even when the check boxes weren’t included. First make sure you are using the latest version of Aspose.Words (9.1) available from here. Next see if it looks better after including the checkboxes. If it still looked messed then can you please attach a screenshot of it?
Also, you may find that it’s looking messed because the content is going off the page which is understandable as the content is quite large width wise. To avoid this you should set the pages to landscape. You can use this code below.

Document doc = new Document();
foreach(Section section in doc.Sections)
{
    section.PageSetup.Orientation = Orientation.Landscape;
}

Thanks,

I am unable to get this to work in VB.NET. I tried to copy the code exactly as above, but there were some differences in using quotes in VB. Here is my converted code, it doesn’t find any matches, even though there are tags in the html text:

Dim regex As New Regex("<INPUT.*?( type=[""’?(?\w*)[""’]?)?.*?>", RegexOptions.IgnoreCase)
Dim matches As MatchCollection = regex.Matches(html)

For Each m As Match In matches
Dim replacement As String = String.Empty
Dim text As String = m.ToString Dim typeText As String = "type="""
Dim startPos As Integer = text.IndexOf("type=""") + typeText.Length
Dim endPos As Integer = text.IndexOf("""", startPos)
Dim fieldType As String = text.Substring(startPos, endPos - startPos)
If String.IsNullOrEmpty(fieldType) Then

fieldType = "text"
End If
If (fieldType = "text" Or fieldType = "checkbox") Then
replacement = String.Format("&al;%{0}%>", fieldType)

End If
html = html.Replace(m.Value, replacement)

Next

Hi Dan,
Thanks for your inquiry.
Please use the code below:

Dim regex As New Regex("<INPUT.*?( type=[""’]?(?\w*)[""’]?)?.*?>", RegexOptions.IgnoreCase)

’ Replace input tags with placeholders
Dim matches As MatchCollection = regex.Matches(html)

For Each m As Match In matches
Dim replacement As String = String.Empty
’ by default INPUT tag is textbox, so if type of it is not specified we should insert textbox
Dim fieldType As String = m.Groups("type").Value
If String.IsNullOrEmpty(fieldType) Then

fieldType = "text"
End If
’ Placeholders will look like the following <%text%>,
’ tags are needed that placeholders will be separate runs
If fieldType = "text" OrElse fieldType = "checkbox" Then
replacement = String.Format("<%{0}%>", fieldType)

End If
html = html.Replace(m.Value, replacement)

Next m
builder.InsertHtml(html)
Dim placeholderRegex As New Regex("<%(?\w*)%>", RegexOptions.IgnoreCase)
doc.Range.Replace(placeholderRegex, New HandleInputTags(), False)

Public Class HandleInputTags
Implements IReplacingCallback
Private Function IReplacingCallback_Replacing(ByVal args As ReplacingArgs) As ReplaceAction Implements IReplacingCallback.Replacing
’ This is a Run node that contains placeholder string
Dim currentNode As Node = args.MatchNode

’ Create DocumentBuilder
Dim builder As New DocumentBuilder(CType(currentNode.Document, Document))
’ Move documentBuilder cursor to th ematched node and insert formfield
builder.MoveTo(currentNode)
’ Every formfield in word docuemnt has bookmark asociated with the formfield
’ There cannot be two bookmarks with the same name in the document

’ So we should generate unique nema for each formfield
Dim name As String = Guid.NewGuid().ToString("N")

'Insert formfield
If args.Match.Groups("type").Value = "text" Then
builder.InsertTextInput(name, TextFormFieldType.RegularText, String.Empty, " ", 0)
ElseIf args.Match.Groups("type").Value = "checkbox" Then
builder.InsertCheckBox(name, False, 10)

End If
’ Signal to the replace engine to do replace matshed text.
Return ReplaceAction.Replace
End Function
End Class

Thanks,

Still not working. It correctly finds 5 matches, however, each match is an empty string. Its like it finds the INPUT tags, but doesn’t recognize the “type=checkbox”. I have 5 checkboxes in the html. Does the regex expression need to be further adjusted? Here is what the checkbox input type looks like in the html:

Hi Dan,
Thank you for additional information. I modified the regular expression a bit. Please try using this one:

Dim regex As New Regex("<INPUT.*?( type=[""’]?(?\w*)[""’]?).*?>", RegexOptions.IgnoreCase)

Now it supposes that ‘type’ attribute is required. It seems to work properly with your document.
Please let me know if you need more assistance, I will be glad to help you.
Best regards,

Works great now, thanks!

The issues you have found earlier (filed as 981) have been fixed in this update.