Free Support Forum - aspose.com

Can I extract Lists from Word documents and convert them to html strings?

Hi Aspose team,

I would like to get individual lists from Word documents and convert them to html.

If a Word document had the following form:

"Text in Word document. Here’s a list:

  1. One
  2. Two
  3. Three
Word document end."

I would like to get:

"

  1. One

  2. Two

  3. Three

"

I had the same problem with tables (

http://www.aspose.com/community/forums/280791/can-i-extract-tables-from-word-documents-and-convert-them-to-html-strings/showthread.aspx#280791). The problem is that while there is a Table node, there is no List node, so I can’t use the code provided in the earlier post.

Thanks in advance,

Alan





Hi

Thanks for your request. Aspose.Words outputs list to HTML like simple paragraphs. This was done to output list bullets and numbering better. As you may know, there is no native method in HTML to output multilevel lists. By the way MS Word outputs lists the similar way.

However, we will consider adding an option, which controls how lists are exported to HTML. Your request has been linked to the appropriate issue. You will be notified as soon as it is resolved.

Best regards,

Hi,

This is an issue that i am facing too. Please let me know if this is resolved. If yes, it will help me if you can provide me with a sample code to achieve this.

Thanks and regards,
Damodar

Hello

Thanks for your inquiry. Unfortunately, this issue is still unresolved. You will be notified as soon as it is fixed. Sorry for inconvenience.

Best regards,

Hi there,


Thanks for your inquiry.

You can still achieve this fairly easily by implementing your own visitor which visits paragraphs in the documents and builds HTML lists from these. Please see the code below which demonstrates this. You just need to pass one of the paragraphs belonging to the list you want to extract and the entire list will be extracted to HTML.

string listHtml = ListVisitor.ExtractAsHtml(doc.FirstSection.Body.FirstParagraph);

public class ListVisitor : DocumentVisitor

{

private int mListId = -1;

private int mCurrentLevel = -1;

private Paragraph mPreviousListItem;

private StringBuilder mHtmlBuilder = new StringBuilder();

private ListVisitor(int listId)

{

mListId = listId;

}

public static string ExtractAsHtml(Paragraph para)

{

if(!para.IsListItem)

throw new ArgumentException("Paragraph must be a list item");

ListVisitor visitor = new ListVisitor(para.ListFormat.List.ListId);

para.Document.Accept(visitor);

return visitor.mHtmlBuilder.ToString();

}

public override VisitorAction VisitParagraphStart(Paragraph paragraph)

{

if (IsListItem(paragraph))

{

CheckAndAddListTags(paragraph.ListFormat);

mHtmlBuilder.Append("

  • "
  • );

    mHtmlBuilder.Append(paragraph.ToTxt().Trim());

    mHtmlBuilder.AppendLine("");

    mPreviousListItem = paragraph;

    }

    return VisitorAction.Continue;

    }

    public override VisitorAction VisitDocumentEnd(Document doc)

    {

    mCurrentLevel++;

    CheckAndAddListTags(mPreviousListItem.ListFormat);

    return VisitorAction.Continue;

    }

    private bool IsListItem(Paragraph para)

    {

    return para.IsListItem && para.ListFormat.List.ListId == mListId;

    }

    private bool IsOrderedList(ListLevel listLevel)

    {

    return listLevel.NumberStyle != NumberStyle.Bullet;

    }

    private void CheckAndAddListTags(ListFormat format)

    {

    if (format.ListLevelNumber > mCurrentLevel)

    mHtmlBuilder.AppendLine(IsOrderedList(format.ListLevel) ? "

      "
    : "
      "
    );

    else if (format.ListLevelNumber < mCurrentLevel)

    mHtmlBuilder.AppendLine(IsOrderedList(mPreviousListItem.ListFormat.ListLevel) ? "" : "");

    mCurrentLevel = format.ListLevelNumber;

    }

    }


    Thanks,

    The issues you have found earlier (filed as WORDSNET-1170) have been fixed in this .NET update and this Java update.


    This message was posted using Notification2Forum from Downloads module by aspose.notifier.
    (14)