Convert DOC/DOCX to HTML - problem with WORDART & TEXTBOXES

HI,

We are using ASPOSE WORDS for .NET version 9 .

We need to convert doc / docx to html . Our documents contains WORDARTs and TEXTBOXES
which are lost on conversion .

I am using ExportImageSavingEventHandler to handle the images , but it is not helpful for wordarts and textboxes.

Please assist.

thanks .

Hi

Thanks for your request. Aspose.Words does not support WordArt objects upon exporting HTML. Your request has been linked to the appropriate issue. You will be notified as soon as this feature is supported.
Regarding textboxes, please attach your document here, I will check the issue and provide you more information.
Also, I think information provided here could be useful for you:
https://docs.aspose.com/words/net/save-in-html-xhtml-mhtml-formats/
Best regards.

Attached file with textbox example.
Please check this issue .

Thanks.

Hi

Thanks for your request. Unfortunately, Aspose.Words does not support exporting TextBox shapes into HTML. Currently, text from TextBox shapes is exported as ordinary text. Your request has been linked to the appropriate issue. You will be notified as soon as it is resolved.
Best regards.

Thanks for reply.

Yes , I see that textbox text is exported as expected.
The problem is , that textbox border is also exported and located near to text.
So I I had text in textbox in WORD , I will get empty frame and text near it .

How can I avoid getting empty frame ?

Thanks .

Hi

Thanks for your inquiry. You can try using the following code to remove textboxes:

// Open document.
Document doc = new Document(@"Test001\test.doc");
// Get all shapes.
Node[] shapes = doc.GetChildNodes(NodeType.Shape, true).ToArray();
// Loop over all shapes and remove textboxes.
foreach(Shape shape in shapes)
{
    if (shape.ShapeType == ShapeType.TextBox)
    {
        // Insert all child nodes of the textbox after the textbox.
        CompositeNode parentNode = shape.ParentNode;
        while (shape.HasChildNodes)
            parentNode.ParentNode.InsertAfter(shape.LastChild, parentNode);
        // Remove shape.
        shape.Remove();
    }
}
// Save output document.
doc.Save(@"Test001\out.html");

Hope this helps.
Best regards.

thanks for provided solution .

It really helped for textboxes.

Now the question if it is possible to make similar workaround for word art :

is it possible to do one of the following

  1. get text from word art and display only its text ( event without style / color , just text ) ?
  2. if previous impossible at all , at list just remove them ?

Thanks

Hi Luiza,
If it is applicable you can convert the Word Art and text boxes into images and replace the originals with them in your document before converting to HTML. This will allow you to still have them in your output HTML. The method it uses is one I wrote a while ago and adapted for use here. Please see the code below for the implementation.

Document doc = new Document(dataDir + "Test In.doc");
DocumentBuilder builder = new DocumentBuilder(doc);
Node[] nodes = doc.GetChildNodes(NodeType.Shape, true).ToArray();
int shapeID = 0;
foreach(Shape shape in nodes)
{
    if (shape.IsWordArt || shape.ShapeType == ShapeType.TextBox)
    {
        Image image = RenderShapeToImage(doc, shape, shapeID);
        builder.MoveTo(shape);
        builder.InsertImage(image);
        shape.Remove();
    }
    shapeID++;
}
doc.Save("Test Out.mhtml");
private static Image RenderShapeToImage(Document doc, Shape shape, int shapeID)
{
    // Create a clone of the document render just the shape in
    Document tempDoc = doc.Clone();
    // Using ImportNode seems to not work for Shape so instead we clone the shape in the
    // document then remove all nodes and reinsert it at the very beginning of the document.
    NodeCollection nodes = tempDoc.GetChildNodes(NodeType.Shape, true);
    Shape docShape = (Shape)nodes[shapeID];
    Shape shapeCopy = (Shape)docShape.Clone(true);
    shapeCopy.WrapType = WrapType.None;
    tempDoc.Sections.Clear();
    tempDoc.EnsureMinimum();
    // Set the shape to be at the very top of the document and the page size to the size of
    // the shape so we are only rendering this shape, making it much faster.
    Section firstSection = tempDoc.Sections[0];
    firstSection.PageSetup.LeftMargin = 0;
    firstSection.PageSetup.TopMargin = 0;
    firstSection.PageSetup.PageWidth = shapeCopy.Width + ConvertUtil.PixelToPoint(10); // Add an extra pixel to avoid edge being cut off
    firstSection.PageSetup.PageHeight = shapeCopy.Height + ConvertUtil.PixelToPoint(10);
    // Ensure shape is against the margins
    shapeCopy.Left = 0;
    shapeCopy.Top = 0;
    // Insert into document
    tempDoc.FirstSection.Body.FirstParagraph.AppendChild(shapeCopy);
    // Render the document which is displaying just the shape to an image stream
    MemoryStream stream = new MemoryStream();
    tempDoc.SaveToImage(0, 1, stream, ImageFormat.Bmp, null);
    System.Drawing.Image image = Image.FromStream(stream);
    // Return the rendered image
    return image;
}

Attached are the document file that is loaded in and the html file with the replaced images generated by running the code.
Please note that RTL text is not supported while rendering at this time. This is why the text in the textbox is moved to the left. This will be supported sometime in the future.
If this is not suitable for you then to retrieve the text from a Word Art shape you can use this code:

string wordArtText = shape.TextPath.Text;

To delete all WordArt shapes in a document:

Node[] nodes = doc.GetChildNodes(NodeType.Shape, true).ToArray();
foreach(Shape shape in nodes)
{
    if (shape.IsWordArt)
    {
        shape.Remove();
    }
}

Please feel free to ask if you have any further queries.
Thanks,

Hi

Thanks for great reply.

Is it possible instead of converting WORDART to image , just take its text and insert instead of it ?

I mean change line with builder.InsertImage(image);

to something like builder.InsertText(textFromWordArt) ?

How it could be done ?

Hi

Thanks for your inquiry. Sure, you can. You can use almost the same approach as I suggested for TextBoxes. Please try using the following code:

// Open document.
Document doc = new Document(@"Test001\in.doc");
// Get all shapes.
Node[] shapes = doc.GetChildNodes(NodeType.Shape, true).ToArray();
// Loop over all shapes and remove textboxes.
foreach(Shape shape in shapes)
{
    // If text of TextPath is not empty, we can suppose this is WordArt shape
    if (!string.IsNullOrEmpty(shape.TextPath.Text))
    {
        // Create Run, which will representtext of WordArt shape.
        Run run = new Run(doc, shape.TextPath.Text);
        run.Font.Name = shape.TextPath.FontFamily;
        run.Font.Size = shape.TextPath.Size;
        // Insert Run with text at the shape position.
        shape.ParentNode.InsertAfter(run, shape);
        // Remove shape.
        shape.Remove();
    }
}
// Save output document.
doc.Save(@"Test001\out.html");

Hope this helps. Please let me know if you need more assistance, I will be glad to help you.
Best regards.

The issues you have found earlier (filed as 4839) have been fixed in this update.

Hello!
Thank you for your patience.
Contents inside text boxes are now output in raster form. You can try the latest version. But floating objects are still positioned inline. We’re still considering this issue to resolve. Sorry for inconvenience.
Regards,

The issues you have found earlier (filed as 1144) have been fixed in this update.

This message was posted using Notification2Forum from Downloads module by aspose.notifier.