I’m having problems getting MS Word numbered list into HTML content.
I have the attached sample program and a sample “Original.doc”.
I’ve observed that if you run the program as-is, the generated HTML <u><b>does have</b></u> <ol> and <li> elements.
However, if you un-comment the commented lines and replace the call convertDocumentToHTML(w_Doc) at line 15 with convertDocumentToHTML(w_TempDoc), the generated HTML <u><b>does not</b></u> have <ol> and <li> elements.
The interesting part though is that if you open the generated JustASection.doc in MS Word, MS Word shows the numbered list. It might as well be that MS Word “manages” to show it as a list, but it isn’t really a list in the output JustASection.doc.
My requirement is that the HTML content I get after saving JustASection.doc as HTML should have <ol> and <li> elements.
Is it a bug in the importNode API or is something wrong my code?
TIA.
Bandu.
P.S.:
(1) I’ve also tried doing a w_TempDoc.updateFields(), but in vain.
(2) I don’t remember which version of Aspose I downloaded back in Feb 2010, but the MANIFEST.MF file has the following content:
Thank you for reporting this problem to us. I managed to reproduce it on my side. Your request has been linked to the appropriate issue. You will be notified as soon as it is resolved.
Thanks for your request. There is no public access to our defect tracking system. So you cannot check the issue status there. We will inform you in this forum thread once there is some progress with this issue.
Another related problem that I am facing is that when using the DocumentVisitor model, there is no corresponding API for Lists - the way you have for Tables. Ideally, there should’ve been one for an entire list, since if we use (say) visitParagraphxxx API, every list item would be fetched as a single line of text.
Does this make sense, or am I missing something in the DocumentVisitor model?
Thanks for your inquiry. In Ms Word documents list items are just paragraphs with special attributes. So the first item of the list can be at the beginning of the document, the last item can be at the end of the document and there can be a lot of content (which does not belong to list items) between these items. So there cannot be ListStart/ListEnd.
Consider a paragraph which is something like this:
```html
<b>Customer1:</b>
Customer with following licenses:
<ol><li>Developer.</li><li>Site.</li><li>OEM.</li></ol>Some more text.
<b>Customer2:</b>
Licenses for this customer have expired.
```
I am currently using the DocumentVisitor model to read the document. I keep collecting paragraphs/ shapes/ tables as and when they occur and as soon as I am done with a customer, I generate a HTML out of my collection of Nodes that I’ve collected for that customer. Normal paras, tables, and shapes work well so far, but as you can see, with my Customer1, using the current approach, I get 3 different paragraphs for Developer, Site, and OEM; and they end up as 1. Developer. 1. Site., and 1. OEM.>entries in my generated HTML. What should I do to get them as <ol> and <li> items in the generated HTML?
Would the issue that you have taken up for resolving, help me in any way to achieve <ol> and <li> in my generated HTML?
I understand that it would be difficult. It was easier for you to have visitTablexxx APIs since the paragraph breaks for table cells are of different type than for a normal paragraph break. But, I see that a paragraph break for each numbered list item in MS Word is the same as a normal paragraph break. So, I guess, it would be difficult for you guys to distinguish between a normal paragraph break and a break appearing for a list item. However, I also notice some symbol between the number and the text which looks like → (an arrow). What symbol is this? and would it be possible to distinguish a list item paragraph break from a normal paragraph break?
Consider a paragraph which is something like this:
```html
<b>Customer1:</b>
Customer with following licenses:
<ol><li>Developer.</li><li>Site.</li><li>OEM.</li></ol>Some more text.
<b>Customer2:</b>
Licenses for this customer have expired.
```
I am currently using the DocumentVisitor model to read the document. I keep collecting paragraphs/ shapes/ tables as and when they occur and as soon as I am done with a customer, I generate a HTML out of my collection of Nodes that I’ve collected for that customer. Normal paras, tables, and shapes work well so far, but as you can see, with my Customer1, using the current approach, I get 3 different paragraphs for Developer, Site, and OEM; and they end up as 1. Developer. 1. Site., and 1. OEM. entries in my generated HTML. What should I do to get them as <ol> and <li> items in the generated HTML?
Would the issue that you have taken up for resolving, help me in any way to achieve <ol> and <li> in my generated HTML?
I understand that it would be difficult. It was easier for you to have visitTablexxx APIs since the paragraph breaks for table cells are of different type than for a normal paragraph break. But, I see that a paragraph break for each numbered list item in MS Word is the same as a normal paragraph break. So, I guess, it would be difficult for you guys to distinguish between a normal paragraph break and a break appearing for a list item. However, I also notice some symbol between the number and the text which looks like
→ (an arrow). What symbol is this? and would it be possible to distinguish a list item paragraph break from a normal paragraph break?
Thanks for your request. Have your tried using NodeImporter to import nodes from one document to another? In case of using NodeImporter lists should be preserved. At least numbering should be preserved upon exporting to HTML:
Also, please note, you should use the same instance of NodeImporter to import all nodes from one document to another. Do not create a separate instance of NodeImporter for each node.
So, the only change so far has been that I get proper numbers in the generated HTML. Is there a possibility that I would get <ol> and <li>? now or maybe in some future release?
Thanks,
Bandu.
Edit: Somewhere earlier I forgot to mention that if I do a
convertDocumentToHTML on a document that wasn’t created using Aspose (but in MS Word itself), then its numbered lists <u><b>do</b></u> get generated as HTML <ol> and <li> elements. So, to sum it up, a doc.save(OutputStream, SaveFormat.HTML) <b><u>works</u></b> as expected
<sup>†</sup> if the doc is generated in MS Word, but the same call <u><b>does not</b></u> work as expected
<sup>†</sup> if the doc is generated using Aspose.
Thank you for additional information. It is nice that you at least have proper numbering in output HTML. It seems the problem occurs because non-standard numbering is used. At least Aspose.Words considers imported lists as non-standard. So as a possible solution, you can try resetting numbering after importing list items using NumberFormat.ApplyNumberDefault or NumberFormat.ApplyBulletDefault methods.
Thanks for the info, but unfortunately it still gives the same behavior. I tried it in two ways:
(1) After my paragraphs have been added, I did the following:
// after having imported the paragraphs using either importNode or NodeImporter, do:
if(tempDoc.getLists() != null && tempDoc.getLists().getCount() > 0 && tempDoc.getLists().get(0).getListLevels() != null && tempDoc.getLists().get(0).getListLevels().getCount() > 0)
{
System.out.println("Doing something with NumberFormat");
tempDoc.getLists().get(0).getListLevels().get(0).setAlignment(ListLevelAlignment.LEFT);
tempDoc.getLists().get(0).getListLevels().get(0).setStartAt(1);
tempDoc.getLists().get(0).getListLevels().get(0).setNumberStyle(NumberStyle.ARABIC);
tempDoc.getLists().get(0).getListLevels().get(0).setNumberFormat("\u0000");
}
(and some combination with the set APIs in the ListLevel object)
(2) Generated the tempDoc itself by adding list items to it:
Paragraph w_P = null;
com.aspose.words.List list = tempDoc.getLists().add(ListTemplate.NUMBER_DEFAULT);
ListLevel level1 = list.getListLevels().get(0);
level1.setNumberStyle(NumberStyle.ARABIC);
level1.setStartAt(1);
level1.setNumberFormat("\u0000");
DocumentBuilder builder = new DocumentBuilder(tempDoc);
builder.getListFormat().setList(list);
for(int i = 0; i < a_Paras.size(); i++)
{
w_P = a_Paras.get(i);
builder.writeln("Item:" + i);
}
tempDoc.save("temp.doc");
return convertDocumentToHTML(tempDoc);
Both (1) and (2) above give me the same behavior - i.e. MS Word manages to show proper list items in the generated temp.doc, but the convertDocumentToHTML API does not have any <ol> and <li> elements. Just the numbered text in span elements, something like this:
Thank you for additional information. But this is not exactly what I meant. Here is simple code, which demonstrates the technique I suggested.
// Open destination and source docuemnts.
// In our case source docuemnt contains two lists (numbered and bulleted)
Document dst = new Document("C:\\Temp\\dst.doc");
Document src = new Document("C:\\Temp\\src.doc");
// Create NodeImporter, which will be used to import nodes from source docuemnt.
NodeImporter importer = new NodeImporter(src, dst, ImportFormatMode.USE_DESTINATION_STYLES);
List bulletedList = null;
List numberedList = null;
// Just to demonstrate the technique, we will import only paragraphs from source documents.
for (Paragraph par : src.getFirstSection().getBody().getParagraphs())
{
// Import paragraph into the destination document.
Paragraph dstParagraph = (Paragraph)importer.importNode(par, true);
if (par.isListItem())
{
boolean isBulletedList = dstParagraph.getListFormat().getListLevel().getNumberStyle() == NumberStyle.BULLET;
// Create new paragraph anc copy all content of the source paragraph into the newly created.
Paragraph tmpParagraph = new Paragraph(dst);
for (Node child : dstParagraph.getChildNodes())
tmpParagraph.appendChild(child);
dstParagraph = tmpParagraph;
if (isBulletedList)
{
if (bulletedList == null)
{
dstParagraph.getListFormat().applyBulletDefault();
bulletedList = dstParagraph.getListFormat().getList();
}
else
{
dstParagraph.getListFormat().setList(bulletedList);
}
}
else
{
if (numberedList == null)
{
dstParagraph.getListFormat().applyNumberDefault();
numberedList = dstParagraph.getListFormat().getList();
}
else
{
dstParagraph.getListFormat().setList(numberedList);
}
}
}
// Insert the paragrap into the destination document.
dst.getFirstSection().getBody().appendChild(dstParagraph);
}
// Save output document
dst.save("C:\\Temp\\out.html");
Thanks for the code, but it still gives the same HTML output. If you save the dst document as HTML, then it does not have any <ol> and <li> elements. Just <span> elements.
Thank you for additional information. The code works fine on my side. The output document contains properly formatted HTML list. Please see the attached source documents and output HTML document.