Importing a node with numbered list; not really a numbered list?

Bandu · June 29, 2010, 12:30am

Hello Support,

I’m having problems getting MS Word numbered list into HTML content.

I have the attached sample program and a sample “Original.doc”.
I’ve observed that if you run the program as-is, the generated HTML does have <ol> and <li> elements.
However, if you un-comment the commented lines and replace the call convertDocumentToHTML(w_Doc) at line 15 with convertDocumentToHTML(w_TempDoc), the generated HTML does not have <ol> and <li> elements.

The interesting part though is that if you open the generated JustASection.doc in MS Word, MS Word shows the numbered list. It might as well be that MS Word “manages” to show it as a list, but it isn’t really a list in the output JustASection.doc.

My requirement is that the HTML content I get after saving JustASection.doc as HTML should have <ol> and <li> elements.

Is it a bug in the importNode API or is something wrong my code?

TIA.
Bandu.
P.S.:
(1) I’ve also tried doing a w_TempDoc.updateFields(), but in vain.
(2) I don’t remember which version of Aspose I downloaded back in Feb 2010, but the MANIFEST.MF file has the following content:

Manifest-Version: 1.0
Specification-Title: Aspose.Words for Java
Implementation-Title: Aspose.Words for Java
Specification-Version: 4.0.0.0
Implementation-Version: 4.0.0.0
Specification-Vendor: Aspose Pty Ltd
Implementation-Vendor: Aspose Pty Ltd
Copyright: Copyright 2003-2009 Aspose Pty Ltd

alexey.noskov · June 29, 2010, 5:06am

Hi

Thank you for reporting this problem to us. I managed to reproduce it on my side. Your request has been linked to the appropriate issue. You will be notified as soon as it is resolved.

Best regards.

Bandu · June 29, 2010, 6:36am

Hi,

Thanks for the info. Is there some issue/ defect id related to this issue and a place to track its status?

Thanks,
Bandu.

alexey.noskov · June 29, 2010, 8:42am

Hi

Thanks for your request. There is no public access to our defect tracking system. So you cannot check the issue status there. We will inform you in this forum thread once there is some progress with this issue.

Best regards.

Bandu · June 29, 2010, 9:51am

OK. Thanks.

Another related problem that I am facing is that when using the DocumentVisitor model, there is no corresponding API for Lists - the way you have for Tables. Ideally, there should’ve been one for an entire list, since if we use (say) visitParagraphxxx API, every list item would be fetched as a single line of text.

Does this make sense, or am I missing something in the DocumentVisitor model?

Thanks,
Bandu.

alexey.noskov · June 29, 2010, 11:09am

Hi

Thanks for your inquiry. In Ms Word documents list items are just paragraphs with special attributes. So the first item of the list can be at the beginning of the document, the last item can be at the end of the document and there can be a lot of content (which does not belong to list items) between these items. So there cannot be ListStart/ListEnd.

See the attached document for example.

Best regards.

Bandu · June 29, 2010, 12:26pm

I don’t see any attachment, but I got the idea.

So, my (revised) problem is as follows:

Consider a paragraph which is something like this:

```html
<b>Customer1:</b>
Customer with following licenses:
<ol><li>Developer.</li><li>Site.</li><li>OEM.</li></ol>Some more text.

<b>Customer2:</b>
Licenses for this customer have expired.
```

I am currently using the DocumentVisitor model to read the document. I keep collecting paragraphs/ shapes/ tables as and when they occur and as soon as I am done with a customer, I generate a HTML out of my collection of Nodes that I’ve collected for that customer. Normal paras, tables, and shapes work well so far, but as you can see, with my Customer1, using the current approach, I get 3 different paragraphs for Developer, Site, and OEM; and they end up as 1. Developer. 1. Site., and 1. OEM.>entries in my generated HTML. What should I do to get them as <ol> and <li> items in the generated HTML?

Would the issue that you have taken up for resolving, help me in any way to achieve <ol> and <li> in my generated HTML?

I understand that it would be difficult. It was easier for you to have visitTablexxx APIs since the paragraph breaks for table cells are of different type than for a normal paragraph break. But, I see that a paragraph break for each numbered list item in MS Word is the same as a normal paragraph break. So, I guess, it would be difficult for you guys to distinguish between a normal paragraph break and a break appearing for a list item. However, I also notice some symbol between the number and the text which looks like → (an arrow). What symbol is this? and would it be possible to distinguish a list item paragraph break from a normal paragraph break?

TIA.
Bandu.

Bandu · June 29, 2010, 12:26pm

I don’t see any attachment, but I got the idea.

So, my (revised) problem is as follows:

Consider a paragraph which is something like this:

```html
<b>Customer1:</b>
Customer with following licenses:

<ol><li>Developer.</li><li>Site.</li><li>OEM.</li></ol>Some more text.

<b>Customer2:</b>
Licenses for this customer have expired.
```

I am currently using the DocumentVisitor model to read the document. I keep collecting paragraphs/ shapes/ tables as and when they occur and as soon as I am done with a customer, I generate a HTML out of my collection of Nodes that I’ve collected for that customer. Normal paras, tables, and shapes work well so far, but as you can see, with my Customer1, using the current approach, I get 3 different paragraphs for Developer, Site, and OEM; and they end up as 1. Developer. 1. Site., and 1. OEM. entries in my generated HTML. What should I do to get them as <ol> and <li> items in the generated HTML?

Would the issue that you have taken up for resolving, help me in any way to achieve <ol> and <li> in my generated HTML?

I understand that it would be difficult. It was easier for you to have visitTablexxx APIs since the paragraph breaks for table cells are of different type than for a normal paragraph break. But, I see that a paragraph break for each numbered list item in MS Word is the same as a normal paragraph break. So, I guess, it would be difficult for you guys to distinguish between a normal paragraph break and a break appearing for a list item. However, I also notice some symbol between the number and the text which looks like

→ (an arrow). What symbol is this? and would it be possible to distinguish a list item paragraph break from a normal paragraph break?

TIA.
Bandu.

alexey.noskov · June 29, 2010, 12:39pm

Hi

Thanks for your request. Have your tried using NodeImporter to import nodes from one document to another? In case of using NodeImporter lists should be preserved. At least numbering should be preserved upon exporting to HTML:

https://reference.aspose.com/words/java/com.aspose.words/NodeImporter

This “arrow” is just simple tab character. It is not difficult to distinguish between simple paragraph and list item. See IsListItem property:
https://reference.aspose.com/words/java/com.aspose.words/ParagraphFormat

Best regards.

Bandu · June 29, 2010, 12:49pm

No, I haven’t explored these options. I’ll give these a try first thing tomorrow morning and will update you on the same.

Many thanks for a quick reply and all the suggestions.

Regards,
Bandu.

alexey.noskov · June 29, 2010, 12:56pm

Also, please note, you should use the same instance of NodeImporter to import all nodes from one document to another. Do not create a separate instance of NodeImporter for each node.

Best regards.

Bandu · June 30, 2010, 12:02am

Hi,

Thanks for your suggestions.

I used the isListItem method to keep a separate collection of list item paragraphs and it worked well.

Then, after having collected these paras, I passed them all to a single function and used NodeImporter to import these paras.

However, I still do not get

in my HTML. But, as you said earlier, atleast numbering is preserved in the generated HTML.
Following is my API:
```
private String getParagraphArrayAsHTML(ArrayList<Paragraph> parasOfListItems) throws Exception
{
 if(parasOfListItems == null || parasOfListItems.size() == 0)
 return "";
 Document tempDoc = new Document();
 NodeImporter nodeImportr = new NodeImporter(m_Doc, tempDoc, ImportFormatMode.KEEP_SOURCE_FORMATTING);
 Paragraph singlePara = null;
 for(int i = 0; i < parasOfListItems.size(); i++)
 {
 singlePara = parasOfListItems.get(i);
 //tempDoc.getFirstSection().getBody().appendChild(tempDoc.importNode(singlePara, true, ImportFormatMode.KEEP_SOURCE_FORMATTING));
 tempDoc.getFirstSection().getBody().appendChild(nodeImportr.importNode(singlePara, true));
 }
 tempDoc.save("temp.doc");
 return convertDocumentToHTML(tempDoc);
}
```
So, the only change so far has been that I get proper numbers in the generated HTML. Is there a possibility that I would get <ol> and <li>? now or maybe in some future release?

Thanks,
Bandu.

Edit: Somewhere earlier I forgot to mention that if I do a
convertDocumentToHTML on a document that wasn’t created using Aspose (but in MS Word itself), then its numbered lists do get generated as HTML <ol> and <li> elements. So, to sum it up, a doc.save(OutputStream, SaveFormat.HTML) works as expected

† if the doc is generated in MS Word, but the same call does not work as expected

† if the doc is generated using Aspose.

† generating <ol>, <li> elements.

alexey.noskov · June 30, 2010, 6:04am

Hi

Thank you for additional information. It is nice that you at least have proper numbering in output HTML. It seems the problem occurs because non-standard numbering is used. At least Aspose.Words considers imported lists as non-standard. So as a possible solution, you can try resetting numbering after importing list items using NumberFormat.ApplyNumberDefault or NumberFormat.ApplyBulletDefault methods.

Hope this helps.

Best regards.

Bandu · June 30, 2010, 11:21am

Thanks for the info, but unfortunately it still gives the same behavior. I tried it in two ways:

(1) After my paragraphs have been added, I did the following:

// after having imported the paragraphs using either importNode or NodeImporter, do:
if(tempDoc.getLists() != null && tempDoc.getLists().getCount() > 0 && tempDoc.getLists().get(0).getListLevels() != null && tempDoc.getLists().get(0).getListLevels().getCount() > 0)
{
    System.out.println("Doing something with NumberFormat");
    tempDoc.getLists().get(0).getListLevels().get(0).setAlignment(ListLevelAlignment.LEFT);
    tempDoc.getLists().get(0).getListLevels().get(0).setStartAt(1);
    tempDoc.getLists().get(0).getListLevels().get(0).setNumberStyle(NumberStyle.ARABIC);
    tempDoc.getLists().get(0).getListLevels().get(0).setNumberFormat("\u0000");
}

(and some combination with the set APIs in the ListLevel object)

(2) Generated the tempDoc itself by adding list items to it:

Paragraph w_P = null;
com.aspose.words.List list = tempDoc.getLists().add(ListTemplate.NUMBER_DEFAULT);
ListLevel level1 = list.getListLevels().get(0);
level1.setNumberStyle(NumberStyle.ARABIC);
level1.setStartAt(1);
level1.setNumberFormat("\u0000");
DocumentBuilder builder = new DocumentBuilder(tempDoc);
builder.getListFormat().setList(list);
for(int i = 0; i < a_Paras.size(); i++)
{
    w_P = a_Paras.get(i);
    builder.writeln("Item:" + i);
}
tempDoc.save("temp.doc");
return convertDocumentToHTML(tempDoc);

Both (1) and (2) above give me the same behavior - i.e. MS Word manages to show proper list items in the generated temp.doc, but the convertDocumentToHTML API does not have any <ol> and <li> elements. Just the numbered text in span elements, something like this:

<span>1</span><span>Aspose</span> <span>2</span><span>Australia</span> <span>3</span><span>India</span> <span>4</span><span>New Zealand</span>

Regards,
Bandu.

alexey.noskov · July 1, 2010, 2:26am

Hi

Thank you for additional information. But this is not exactly what I meant. Here is simple code, which demonstrates the technique I suggested.

// Open destination and source docuemnts.
// In our case source docuemnt contains two lists (numbered and bulleted)
Document dst = new Document("C:\\Temp\\dst.doc");
Document src = new Document("C:\\Temp\\src.doc");
// Create NodeImporter, which will be used to import nodes from source docuemnt.
NodeImporter importer = new NodeImporter(src, dst, ImportFormatMode.USE_DESTINATION_STYLES);
List bulletedList = null;
List numberedList = null;
// Just to demonstrate the technique, we will import only paragraphs from source documents.
for (Paragraph par : src.getFirstSection().getBody().getParagraphs())
{
    // Import paragraph into the destination document.
    Paragraph dstParagraph = (Paragraph)importer.importNode(par, true);
    if (par.isListItem())
    {
        boolean isBulletedList = dstParagraph.getListFormat().getListLevel().getNumberStyle() == NumberStyle.BULLET;
        // Create new paragraph anc copy all content of the source paragraph into the newly created.
        Paragraph tmpParagraph = new Paragraph(dst);
        for (Node child : dstParagraph.getChildNodes())
            tmpParagraph.appendChild(child);
        dstParagraph = tmpParagraph;
        if (isBulletedList)
        {
            if (bulletedList == null)
            {
                dstParagraph.getListFormat().applyBulletDefault();
                bulletedList = dstParagraph.getListFormat().getList();
            }
            else
            {
                dstParagraph.getListFormat().setList(bulletedList);
            }
        }
        else
        {
            if (numberedList == null)
            {
                dstParagraph.getListFormat().applyNumberDefault();
                numberedList = dstParagraph.getListFormat().getList();
            }
            else
            {
                dstParagraph.getListFormat().setList(numberedList);
            }
        }
    }
    // Insert the paragrap into the destination document.
    dst.getFirstSection().getBody().appendChild(dstParagraph);
}
// Save output document
dst.save("C:\\Temp\\out.html");

Hope this helps.

Best regards.

Bandu · July 1, 2010, 8:56am

Thanks for the code, but it still gives the same HTML output. If you save the dst document as HTML, then it does not have any <ol> and <li> elements. Just  elements.

Regards,

alexey.noskov · July 1, 2010, 10:45am

Hi

Thank you for additional information. The code works fine on my side. The output document contains properly formatted HTML list. Please see the attached source documents and output HTML document.

Best regards.

aspose.notifier · December 31, 2012, 1:25am

The issues you have found earlier (filed as WORDSNET-3616) have been fixed in this .NET update and this Java update.

This message was posted using Notification2Forum from Downloads module by aspose.notifier.