Extract Text with Formatting

SQEAS · March 30, 2010, 7:16am

Latest shipments:

Item 8,9,10 complete – 23-Mar-2010

Next forecast shipments:

Items 1 & 2 complete – 12-Apr-2010

Items 5 Thermal couple - 15-Apr-2010

I’m trying to extract the text – and formatting – shown above.

Using the code below, I’m able to get the carriage control / linefeeds at the end of each line, but the indentations (tabs) are lost. In other words, I’m getting: “Latest shipments:\r\nItem 8,9,10 complete – 23-Mar-2010\r\nNext forecast shipments:\r\nItems 1 & 2 complete – 12-Apr-2010\r\nItems 5 Thermal couple - 15-Apr-2010\r\n\r\n”.

But I want something like: “Latest shipments:\r\n**\tItem 8,9,10 complete – 23-Mar-2010\r\nNext forecast shipments:\r\n\t** Items 1 & 2 complete – 12-Apr-2010\r\n**\t** tems 5 Thermal couple - 15-Apr-2010\r\n\r\n”.

// Move DocumentBuilder cursor to the bookmark

DocumentBuilder builder = new DocumentBuilder(doc);

bool foundBookmark = builder.MoveToBookmark(bookmark, true, false);

if (foundBookmark)

{

Paragraph para = builder.CurrentParagraph;

Cell cell = para.GetAncestor(NodeType.Cell) as Cell;

if (cell != null)

{

text = cell.ToTxt();

}

Thanks.

This message was posted using Page2Forum from Aspose.Words for .NET and Java

alexey.noskov · March 30, 2010, 8:04am

Hi

Thanks for your request. Could you please attach your document here for testing? Maybe in your document there is no tabs, maybe there is paragraph left indent specified.

Best regards.

SQEAS · March 31, 2010, 9:26am

You’re right; it looks like it’s an indented paragraph (no tabs).

Ultimately, I’m trying to maintain the formatting and “look and feel” in the Shipping Summary. It might have indent/tabs, cr/lf, bulleted lists, etc.

Other than cell.ToTxt(), is there some other method that will maintain the formatting? Maybe something like cell.AsHtml()?

Thanks.

alexey.noskov · March 31, 2010, 10:43am

Hi

Thank you for additional information. Unfortunately, there is no direct way to get HTML of the particular node. However, you can achieve this. In your case, you should just copy content of table cell into a separate, empty document and then convert this document to HTML. For example, see the following code:

// Open document.

Document doc = new Document(@“Test001\in.doc”);

// Get table cell, from which we shoudl extract HTML formated text.

Cell cell = doc.FirstSection.Body.Tables[0].Rows[1].FirstCell;

// Create an empty intermediate document.

Document temp = new Document();

// Copy content of the table cell into the intermediate docment.

foreach (Node node in cell.ChildNodes)

temp.FirstSection.Body.AppendChild(temp.ImportNode(node, true, ImportFormatMode.KeepSourceFormatting));

// Get HTML string, which representd the temporary document.

string html = ConvertDocumentToHtml(temp);

================================================================

public string ConvertDocumentToHtml(Document doc)

{

string html = string.Empty;

// Save docuemnt to MemoryStream in Hml format

using (MemoryStream htmlStream = new MemoryStream())

{

doc.Save(htmlStream, SaveFormat.Html);

// Get Html string

html = Encoding.UTF8.GetString(htmlStream.GetBuffer(), 0, (int)htmlStream.Length);

}

// There could be BOM at the beggining of the string.

// We should remove it from the string.

while (html[0]!=’<’)

html = html.Substring(1);

return html;

}

Hope this helps.

Best regards.