Aspose.Words.Tables.Table ToString method with SaveFormat.Html leads to null reference exception

pavel.pavlov · February 13, 2014, 7:11am

Hi,

Im parsing word document based on its TOC and retrieveing all data from certain chapters. As a result I get the list of List - all nodes inside certain chapter. After Im looping through this list and calling ToString(SaveFormat.Html). for each node. This works ok for simple paragraphs (which are almost 95% of document) but in case I have a table node than I get null reference exception. Calling ToString(SaveFormat.Text) works fine but I would like to retrive html markup with all styles, not a plain text. When Im calling ToString(SaveFormat.Html) on whole doument it works fine and return html markup including markup for tables.

Will appreciate any help. Thanks.

awais.hafeez · February 14, 2014, 3:42am

Hi Pavel,

Thanks for your inquiry.

After an initial test with Aspose.Words 14.1.0, I was unable to reproduce this exception on my side. I would suggest you please upgrade to the latest version of Aspose.Words. You can download it from the following link. I hope, this helps:
https://releases.aspose.com/words/net

Moreover, I used the following simple code on my side for testing:

Document doc = new Aspose.Words.Document(@"C:\Temp\test.docx");
Table tab = doc.FirstSection.Body.Tables[0];
string html = tab.ToString(SaveFormat.Html);

In case the problem still remains, please create a standalone runnable console application that helps us reproduce your problem on our end and attach it here for testing. We will investigate the issue further and provide you more information.

Best regards,

pavel.pavlov · February 14, 2014, 5:20am

Hi Awais,

Thanks for you reply. I`m using the latest version of aspose.words.
I believe that the reason of this exception is in the code that is extracting data based on TOC of document. Actually it is slightly modified code taken from here.
In attachment you can find a small app, using it you can reproduce this issue.

Thanks,
Pavel Pavlov

tahir.manzoor · February 17, 2014, 4:16am

Hi Pavel,

Thanks for sharing the detail. We are working over your query and will get back to you soon.

tahir.manzoor · February 17, 2014, 5:46am

Hi Pavel,

Thanks for your patience.

I have tested the scenario and have managed to reproduce the same issue at my side using following code example.

var doc = new Document(MyDir + "test.docx");
var docBuilder = new DocumentBuilder(doc);
docBuilder.MoveToDocumentEnd();
var dummyEndDocNode = docBuilder.InsertParagraph();
docBuilder.Write("ENDOFDOC");
ArrayList extractedNodes = ExtractContent(doc.Range.Bookmarks["_Toc380074442"].BookmarkStart, doc.FirstSection.Body.LastParagraph, false);
Table table = (Table) extractedNodes[1];
string html = table.ToString(SaveFormat.Html);

For the sake of correction, I have logged this problem in our issue tracking system as WORDSNET-9691. I have linked this forum thread to the same issue and you will be notified via this forum thread once this issue is resolved.

We apologize for your inconvenience.

pavel.pavlov · February 19, 2014, 6:30am

Thanks. Will wait for this bug fixes.
Also I`ve 2 more questions.

Using ExractContent method we are getting all content between to specified nodes. There is a paragraph.Clone() method used. The problem is that after we clone a paragraph the ListLabel values get reseted in the new cloned object.

For example in case paragraph is in ordered list in 4th position it has LabelString = ‘4.’ and LabelValue = 4. After we clone the paragraph the newly created clone has “” and 0 correspondingly. And there is no way to set those values equal to original because those properies do not have setters. Is there any way to have this info in cloned object because later when I call

ToString(SaveFormat.Html) for each list memeber i get a list with zeros but i want 1, 2, 3, 4, etc…

In case paragraph contains an image inside how can I get from it an html string with base64 image ? I tried something like

var saveOptions = new HtmlSaveOptions(SaveFormat.Html)
{
    ExportImagesAsBase64 = true
};
paragraph.ToString(saveOptions);

but got only empty tag.

pavel.pavlov · February 20, 2014, 2:07am

Well, it seems that i managed to solve the second issue. Before calling Extract content method i just get all nodes of type DrawingML and call ToString(SaveFormat.Html) for all nodes. The type of nodes changes to Shape and they successfully pass the Clone method keeping all original image data.

But the first issue is still open. Is there any workaround to keep original ListLabe values in cloned paragraph?

Thanks

tahir.manzoor · February 20, 2014, 4:07am

Hi Pavel,

Thanks for your inquiry.

*pavel.pavlov:

Using ExractContent method we are getting all content between to specified nodes. There is a paragraph.Clone() method used. The problem is that after we clone a paragraph the ListLabel values get reseted in the new cloned object.
For example in case paragraph is in ordered list in 4th position it has LabelString = ‘4.’ and LabelValue = 4. After we clone the paragraph the newly created clone has “” and 0 correspondingly. And there is no way to set those values equal to original because those properies do not have setters. Is there any way to have this info in cloned object because later when I call
ToString (SaveFormat.Html) for each list memeber i get a list with zeros but i want 1,2,3,4, etc…*

It would be great if you please share following detail for investigation purposes.

Please attach your input Word document.
Please create a standalone/runnable simple application (for example a Console Application Project) that demonstrates the code (Aspose.Words code) you used to generate your output document
Please attach the output Word file that shows the undesired behavior.
Please attach your target Word document showing the desired behavior. You can use Microsoft Word to create your target Word document. I will investigate as to how you are expecting your final document be generated like.

Unfortunately, it is difficult to say what the problem is without the Document(s) and simplified application. We need your Document(s) and simple project to reproduce the problem. As soon as you get these pieces of information to us we’ll start our investigation into your issue.

pavel.pavlov:
2. In case paragraph contains an image inside how can I get from it an html string with base64 image? I tried something like

It is nice to hear from you that you have solved this issue. Yes, you can exports the content of a Paragraph into a HTML format by using Node.ToString Method (SaveOptions) method.

Document doc = new Document(MyDir + "in.docx");
// Extract the last paragraph in the document to convert to HTML.
Node node = doc.LastSection.Body.LastParagraph;
var saveOptions = new HtmlSaveOptions(SaveFormat.Html)
{
    ExportImagesAsBase64 = true
};
string nodeAsHtml = node.ToString(saveOptions);

pavel.pavlov · February 20, 2014, 5:29am

I`ve attached the sample console app and word file.
In undesired behaviour you may see that all nodes has ‘0’ list number.
And expected behaviour is to get the same 1,2,3 list as in doc.

tahir.manzoor · February 20, 2014, 11:45am

Hi Pavel,

Thanks for your inquiry. In your case, I suggest you please use the following highlighted code snippet. Hope this helps you. Please let us know if you have any more queries.

var doc = new Document(@"C:\temp\Test.docx");
doc.UpdateListLabels();
var paragraphs = doc.GetChildNodes(NodeType.Paragraph, true);
Console.WriteLine("Desired output:");
Console.WriteLine("");
foreach(var paragraph in paragraphs)
{
    if (paragraph is Paragraph && ((Paragraph) paragraph).IsListItem)
    {
        Console.WriteLine(((Paragraph) paragraph).ToString(SaveFormat.Html));
        string html = ((Paragraph) paragraph).ToString(SaveFormat.Html);
    }
}
Console.WriteLine("");
Console.WriteLine("Undesired behaviour:");
Console.WriteLine("");
ArrayList extractedNodes = ExtractContent(doc.FirstSection.Body.FirstParagraph, doc.FirstSection.Body.LastParagraph, true);
Document doc2 = GenerateDocument(doc, extractedNodes);
var paragraphs2 = doc2.GetChildNodes(NodeType.Paragraph, true);
foreach(var paragraph in paragraphs2)
{
    if (paragraph is Paragraph && ((Paragraph) paragraph).IsListItem)
    {
        Console.WriteLine(((Paragraph) paragraph).ToString(SaveFormat.Html));
        string html = ((Paragraph) paragraph).ToString(SaveFormat.Html);
    }
}

pavel.pavlov · February 21, 2014, 1:17am

I suppose you are missing the implemenation of GenerateDocument(doc, extractedNodes);

awais.hafeez · February 21, 2014, 3:54am

Hi Pavel,

Thanks for your inquiry. You can find the implementation of GenerateDocument function in the following article:
https://docs.aspose.com/words/net/how-to-extract-selected-content-between-nodes-in-a-document/

Best regards,

tahir.manzoor · February 21, 2014, 5:50am

Hi Pavel,

Please accept my apologies for your inconvenience. Please check the following GenerateDocumentmethod and let us know if you have any more queries.

public static Document GenerateDocument(Document srcDoc, ArrayList nodes)
{
    // Create a blank document.
    Document dstDoc = new Document();
    // Remove the first paragraph from the empty document.
    dstDoc.FirstSection.Body.RemoveAllChildren();
    // Import each node from the list into the new document. Keep the original formatting of the node.
    NodeImporter importer = new NodeImporter(srcDoc, dstDoc, ImportFormatMode.KeepSourceFormatting);
    foreach(Node node in nodes)
    {
        Node importNode = importer.ImportNode(node, true);
        dstDoc.FirstSection.Body.AppendChild(importNode);
    }
    // Return the generated document.
    return dstDoc;
}

pavel.pavlov · February 21, 2014, 6:13am

Thanks, exactly what I need.

Yet, I have one more question:
Is it possible to save Paragraph in HTML without any font-settings?

For example when I call ToString(SaveFormat.Html) for single paragraph from doc i get

<span style="font-family:Calibri; font-size:11pt">Test text</span>

and desired is

<span>Test text</span>

But I need only excluding font-related styles (Font-name, font-size, font-color).

Or the only way is post process result html with regexp and exclude this styles?

tahir.manzoor · February 24, 2014, 12:36am

Hi Pavel,

Thanks for your inquiry. The Node.ToString method (SaveFormat.Html) export the content of the node into HTML format. However, you can use the SaveFormat.Text to export the contents of node into Text format. Aspose.Words does not offer any API to exclude font related attribute.

aspose.notifier · April 7, 2014, 10:15pm

The issues you have found earlier (filed as WORDSNET-9691) have been fixed in this .NET update and this Java update.

This message was posted using Notification2Forum from Downloads module by aspose.notifier.