Free Support Forum - aspose.com

Importing a node with numbered list; not really a numbered list?

Hello Support,

I’m having problems getting MS Word numbered list into HTML content.

I have the attached sample program and a sample “Original.doc”.
I’ve observed that if you run the program as-is, the generated HTML does have

    and
  1. elements.
    However, if you un-comment the commented lines and replace the call convertDocumentToHTML(w_Doc) at line 15 with convertDocumentToHTML(w_TempDoc), the generated HTML does not have
      and
    1. elements.

      The interesting part though is that if you open the generated JustASection.doc in MS Word, MS Word shows the numbered list. It might as well be that MS Word “manages” to show it as a list, but it isn’t really a list in the output JustASection.doc.

      My requirement is that the HTML content I get after saving JustASection.doc as HTML should have
        and
      1. elements.

        Is it a bug in the importNode API or is something wrong my code?

        TIA.
        Bandu.
        P.S.:
        (1) I’ve also tried doing a w_TempDoc.updateFields(), but in vain.
        (2) I don’t remember which version of Aspose I downloaded back in Feb 2010, but the MANIFEST.MF file has the following content:

        Manifest-Version: 1.0
        Specification-Title: Aspose.Words for Java
        Implementation-Title: Aspose.Words for Java
        Specification-Version: 4.0.0.0
        Implementation-Version: 4.0.0.0
        Specification-Vendor: Aspose Pty Ltd
        Implementation-Vendor: Aspose Pty Ltd
        Copyright: Copyright 2003-2009 Aspose Pty Ltd

Hi

<?xml:namespace prefix = o ns = "urn:schemas-microsoft-com:office:office" />

Thank you for reporting this problem to us. I managed to reproduce it on my side. Your request has been linked to the appropriate issue. You will be notified as soon as it is resolved.

Best regards.

Hi,

Thanks for the info. Is there some issue/ defect id related to this issue and a place to track its status?

Thanks,
Bandu.

Hi

<?xml:namespace prefix = o ns = "urn:schemas-microsoft-com:office:office" />

Thanks for your request. There is no public access to our defect tracking system. So you cannot check the issue status there. We will inform you in this forum thread once there is some progress with this issue.

Best regards.

OK. Thanks.

Another related problem that I am facing is that when using the DocumentVisitor model, there is no corresponding API for Lists - the way you have for Tables. Ideally, there should’ve been one for an entire list, since if we use (say) visitParagraphxxx API, every list item would be fetched as a single line of text.

Does this make sense, or am I missing something in the DocumentVisitor model?

Thanks,
Bandu.

Hi

<?xml:namespace prefix = o ns = "urn:schemas-microsoft-com:office:office" />

Thanks for your inquiry. In Ms Word documents list items are just paragraphs with special attributes. So the first item of the list can be at the beginning of the document, the last item can be at the end of the document and there can be a lot of content (which does not belong to list items) between these items. So there cannot be ListStart/ListEnd.

See the attached document for example.

Best regards.

I don’t see any attachment, but I got the idea.

So, my (revised) problem is as follows:

Consider a paragraph which is something like this:

Customer1:
Customer with following licenses:

  1. Developer.
  2. Site.
  3. OEM.
Some more text.

Customer2:
Licenses for this customer have expired.

I am currently using the DocumentVisitor model to read the document. I keep collecting paragraphs/ shapes/ tables as and when they occur and as soon as I am done with a customer, I generate a HTML out of my collection of Nodes that I’ve collected for that customer. Normal paras, tables, and shapes work well so far, but as you can see, with my Customer1, using the current approach, I get 3 different paragraphs for Developer, Site, and OEM; and they end up as 1. Developer. 1. Site., and 1. OEM.entries in my generated HTML. What should I do to get them as

    and
  1. items in the generated HTML?

    Would the issue that you have taken up for resolving, help me in any way to achieve
      and
    1. in my generated HTML?

      I understand that it would be difficult. It was easier for you to have visitTablexxx APIs since the paragraph breaks for table cells are of different type than for a normal paragraph break. But, I see that a paragraph break for each numbered list item in MS Word is the same as a normal paragraph break. So, I guess, it would be difficult for you guys to distinguish between a normal paragraph break and a break appearing for a list item. However, I also notice some symbol between the number and the text which looks like <!–[if gte mso 10]> /* Style Definitions */ table.MsoNormalTable {mso-style-name:"Table Normal"; mso-tstyle-rowband-size:0; mso-tstyle-colband-size:0; mso-style-noshow:yes; mso-style-parent:""; mso-padding-alt:0in 5.4pt 0in 5.4pt; mso-para-margin:0in; mso-para-margin-bottom:.0001pt; mso-pagination:widow-orphan; font-size:10.0pt; font-family:"Times New Roman"; mso-ansi-language:#0400; mso-fareast-language:#0400; mso-bidi-language:#0400;}

      <![endif]–>→ (an arrow). What symbol is this? and would it be possible to distinguish a list item paragraph break from a normal paragraph break?

      TIA.
      Bandu.

I don’t see any attachment, but I got the idea.

So, my (revised) problem is as follows:

Consider a paragraph which is something like this:

Customer1:
Customer with following licenses:

  1. Developer.
  2. Site.
  3. OEM.
Some more text.

Customer2:
Licenses for this customer have expired.

I am currently using the DocumentVisitor model to read the document. I keep collecting paragraphs/ shapes/ tables as and when they occur and as soon as I am done with a customer, I generate a HTML out of my collection of Nodes that I’ve collected for that customer. Normal paras, tables, and shapes work well so far, but as you can see, with my Customer1, using the current approach, I get 3 different paragraphs for Developer, Site, and OEM; and they end up as 1. Developer. 1. Site., and 1. OEM. entries in my generated HTML. What should I do to get them as

    and
  1. items in the generated HTML?

    Would the issue that you have taken up for resolving, help me in any way to achieve
      and
    1. in my generated HTML?

      I understand that it would be difficult. It was easier for you to have visitTablexxx APIs since the paragraph breaks for table cells are of different type than for a normal paragraph break. But, I see that a paragraph break for each numbered list item in MS Word is the same as a normal paragraph break. So, I guess, it would be difficult for you guys to distinguish between a normal paragraph break and a break appearing for a list item. However, I also notice some symbol between the number and the text which looks like <!–[if gte mso 10]> /* Style Definitions */ table.MsoNormalTable {mso-style-name:"Table Normal"; mso-tstyle-rowband-size:0; mso-tstyle-colband-size:0; mso-style-noshow:yes; mso-style-parent:""; mso-padding-alt:0in 5.4pt 0in 5.4pt; mso-para-margin:0in; mso-para-margin-bottom:.0001pt; mso-pagination:widow-orphan; font-size:10.0pt; font-family:"Times New Roman"; mso-ansi-language:#0400; mso-fareast-language:#0400; mso-bidi-language:#0400;}

      <![endif]–>→ (an arrow). What symbol is this? and would it be possible to distinguish a list item paragraph break from a normal paragraph break?

      TIA.
      Bandu.

Hi

<?xml:namespace prefix = o ns = "urn:schemas-microsoft-com:office:office" />

Thanks for your request. Have your tried using NodeImporter to import nodes from one document to another? In case of using NodeImporter lists should be preserved. At least numbering should be preserved upon exporting to HTML:

http://www.aspose.com/documentation/java-components/aspose.words-for-java/com/aspose/words/nodeimporter.html

This “arrow” is just simple tab character. It is not difficult to distinguish between simple paragraph and list item. See IsListItem property:

http://www.aspose.com/documentation/java-components/aspose.words-for-java/com/aspose/words/paragraphformat.html#IsListItem

Best regards.

No, I haven’t explored these options. I’ll give these a try first thing tomorrow morning and will update you on the same.

Many thanks for a quick reply and all the suggestions.

Regards,
Bandu.

Also, please note, you should use the same instance of NodeImporter to import all nodes from one document to another. Do not create a separate instance of NodeImporter for each node.

<?xml:namespace prefix = o ns = "urn:schemas-microsoft-com:office:office" />

Best regards.

Hi,



Thanks for your suggestions.



I used the isListItem method to keep a separate collection of list item
paragraphs and it worked well.



Then, after having collected these paras, I passed them all to a single
function and used NodeImporter to import these paras.

However, I still do not get

    and
  1. in my HTML. But, as you said earlier, atleast numbering is preserved in the generated HTML.

    Following is my API:

    private String getParagraphArrayAsHTML(ArrayList parasOfListItems) throws Exception
    {
    if(parasOfListItems == null || parasOfListItems.size() == 0)
    return “”;
    Document tempDoc = new Document();
    NodeImporter nodeImportr = new NodeImporter(m_Doc, tempDoc, ImportFormatMode.KEEP_SOURCE_FORMATTING);
    Paragraph singlePara = null;
    for(int i = 0; i < parasOfListItems.size(); i++)
    {
    singlePara = parasOfListItems.get(i);
    //tempDoc.getFirstSection().getBody().appendChild(tempDoc.importNode(singlePara, true, ImportFormatMode.KEEP_SOURCE_FORMATTING));
    tempDoc.getFirstSection().getBody().appendChild(nodeImportr.importNode(singlePara, true));
    }
    tempDoc.save(“temp.doc”);
    return convertDocumentToHTML(tempDoc);
    }

    So, the only change so far has been that I get proper numbers in the generated HTML. Is there a possibility that I would get
      and
    1. ? now or maybe in some future release?

      Thanks,
      Bandu.

      Edit: Somewhere earlier I forgot to mention that if I do a convertDocumentToHTML on a document that wasn’t created using Aspose (but in MS Word itself), then its numbered lists do get generated as HTML
        and
      1. elements. So, to sum it up, a doc.save(OutputStream, SaveFormat.HTML) works as expected<!–[if gte mso 10]> /* Style Definitions */ table.MsoNormalTable {mso-style-name:"Table Normal"; mso-tstyle-rowband-size:0; mso-tstyle-colband-size:0; mso-style-noshow:yes; mso-style-parent:""; mso-padding-alt:0in 5.4pt 0in 5.4pt; mso-para-margin:0in; mso-para-margin-bottom:.0001pt; mso-pagination:widow-orphan; font-size:10.0pt; font-family:"Times New Roman"; mso-ansi-language:#0400; mso-fareast-language:#0400; mso-bidi-language:#0400;}

        <![endif]–><span style=“font-size: 12pt; font-family: “Times New Roman”;”>† if the doc is generated in MS Word, but the same call does not work as expected<!–[if gte mso 10]>

        /* Style Definitions */ table.MsoNormalTable {mso-style-name:"Table Normal"; mso-tstyle-rowband-size:0; mso-tstyle-colband-size:0; mso-style-noshow:yes; mso-style-parent:""; mso-padding-alt:0in 5.4pt 0in 5.4pt; mso-para-margin:0in; mso-para-margin-bottom:.0001pt; mso-pagination:widow-orphan; font-size:10.0pt; font-family:"Times New Roman"; mso-ansi-language:#0400; mso-fareast-language:#0400; mso-bidi-language:#0400;}

        <![endif]–><span style=“font-size: 12pt; font-family: “Times New Roman”;”>† if the doc is generated using Aspose.

        <!–[if gte mso 10]>

        /* Style Definitions */ table.MsoNormalTable {mso-style-name:"Table Normal"; mso-tstyle-rowband-size:0; mso-tstyle-colband-size:0; mso-style-noshow:yes; mso-style-parent:""; mso-padding-alt:0in 5.4pt 0in 5.4pt; mso-para-margin:0in; mso-para-margin-bottom:.0001pt; mso-pagination:widow-orphan; font-size:10.0pt; font-family:"Times New Roman"; mso-ansi-language:#0400; mso-fareast-language:#0400; mso-bidi-language:#0400;}

        <![endif]–><span style=“font-size: 12pt; font-family: “Times New Roman”;”>† generating

          ,
        1. elements.

Hi

<?xml:namespace prefix = o ns = "urn:schemas-microsoft-com:office:office" />

Thank you for additional information. It is nice that you at least have proper numbering in output HTML. It seems the problem occurs because non-standard numbering is used. At least Aspsoe.Words considers imported lists as non-standard. So as a possible solution, you can try resetting numbering after importing list items using NumberFormat.ApplyNumberDefault or NumberFormat.ApplyBulletDefault methods.

Hope this helps.

Best regards.

Thanks for the info, but unfortunately it still gives the same behavior. I tried it in two ways:

(1) After my paragraphs have been added, I did the following:


// after having imported the paragraphs using either importNode or NodeImporter, do:
if(tempDoc.getLists() != null && tempDoc.getLists().getCount() > 0 && tempDoc.getLists().get(0).getListLevels() != null && tempDoc.getLists().get(0).getListLevels().getCount() > 0)
{
System.out.println(“Doing something with NumberFormat”);
tempDoc.getLists().get(0).getListLevels().get(0).setAlignment(ListLevelAlignment.LEFT);
tempDoc.getLists().get(0).getListLevels().get(0).setStartAt(1);
tempDoc.getLists().get(0).getListLevels().get(0).setNumberStyle(NumberStyle.ARABIC);
tempDoc.getLists().get(0).getListLevels().get(0).setNumberFormat("\u0000");
}

(and some combination with the set APIs in the ListLevel object)

(2) Generated the tempDoc itself by adding list items to it:

Paragraph w_P = null;
com.aspose.words.List list = tempDoc.getLists().add(ListTemplate.NUMBER_DEFAULT);
ListLevel level1 = list.getListLevels().get(0);
level1.setNumberStyle(NumberStyle.ARABIC);
level1.setStartAt(1);
level1.setNumberFormat("\u0000");
DocumentBuilder builder = new DocumentBuilder(tempDoc);
builder.getListFormat().setList(list);
for(int i = 0; i < a_Paras.size(); i++)
{
w_P = a_Paras.get(i);
builder.writeln(“Item:” + i);
}
tempDoc.save(“temp.doc”);
return convertDocumentToHTML(tempDoc);


Both (1) and (2) above give me the same behavior - i.e. MS Word manages to show proper list items in the generated temp.doc, but the convertDocumentToHTML API does not have any

    and
  1. elements. Just the numbered text in span elements, something like this:

    1Aspose 2Australia 3India 4New Zealand

    Regards,
    Bandu.

Hi

<?xml:namespace prefix = o ns = "urn:schemas-microsoft-com:office:office" />

Thank you for additional information. But this is not exactly what I meant. Here is simple code, which demonstrates the technique I suggested.

// Open destination and source docuemnts.

// In our case source docuemnt contains two lists (numbered and bulleted)

Document dst = new Document("C:\\Temp\\dst.doc");

Document src = new Document("C:\\Temp\\src.doc");

// Create NodeImporter, which will be used to import nodes from source docuemnt.

NodeImporter importer = new NodeImporter(src, dst, ImportFormatMode.USE_DESTINATION_STYLES);

List bulletedList = null;

List numberedList = null;

// Just to demonstrate the technique, we will import only paragraphs from source documents.

for(Paragraph par : src.getFirstSection().getBody().getParagraphs())

{

// Import paragraph into the destination document.

Paragraph dstParagraph = (Paragraph)importer.importNode(par, true);

if(par.isListItem())

{

boolean isBulletedList = dstParagraph.getListFormat().getListLevel().getNumberStyle() == NumberStyle.BULLET;

// Create new paragraph anc copy all content of the source paragraph into the newly created.

Paragraph tmpParagraph = new Paragraph(dst);

for(Node child : dstParagraph.getChildNodes())

tmpParagraph.appendChild(child);

dstParagraph = tmpParagraph;

if(isBulletedList)

{

if(bulletedList == null)

{

dstParagraph.getListFormat().applyBulletDefault();

bulletedList = dstParagraph.getListFormat().getList();

}

else

{

dstParagraph.getListFormat().setList(bulletedList);

}

}

else

{

if(numberedList == null)

{

dstParagraph.getListFormat().applyNumberDefault();

numberedList = dstParagraph.getListFormat().getList();

}

else

{

dstParagraph.getListFormat().setList(numberedList);

}

}

}

// Insert the paragrap into the destination document.

dst.getFirstSection().getBody().appendChild(dstParagraph);

}

// Save output document

dst.save("C:\\Temp\\out.html");

Hope this helps.

Best regards.

Thanks for the code, but it still gives the same HTML output. If you save the dst document as HTML, then it does not have any

    and
  1. elements. Just elements.

    Regards,
    Bandu.

Hi

<?xml:namespace prefix = o ns = "urn:schemas-microsoft-com:office:office" />

Thank you for additional information. The code works fine on my side. The output document contains properly formatted HTML list. Please see the attached source documents and output HTML document.

Best regards.

The issues you have found earlier (filed as WORDSNET-3616) have been fixed in this .NET update and this Java update.


This message was posted using Notification2Forum from Downloads module by aspose.notifier.