Read Formatted Text

I am attempting to retrieve formatted text from a Word document. I have navigated to the cells in my table and have retrieved text via the cell.getText() method, however, I do not see any formatting characters. I would like to retrieve the text with all formatting available, as it will be saved as Rich Text in my database.
How do I retrieve Rich Text from my cells?
Vincent Serpico
Senior Software Engineer
Interactive Alchemy

The ability to get document node content in different formats, including RTF is not available yet. We have it logged to our defect base as feature request (issue #1165). We will try to add this functionality in one of the next versions. I will inform you when this happens by posting here in this thread.
Best regards,

This is extremely important to us and a multi-license deal. How can I get formatted Word text? I NEED the original formatting data, including Bold, Underlines, etc.

In fact, we are working on supporting this functionality right now. So it’s great news for us that this feature is actually required. If all goes well we have a good chance to implement it by the end of September or even earlier.
Best regards,

“formatted text” is a too wide term. If you want that in RTF format, that’s fine, Vladimir is correct, it will soon be available. But maybe you can use HTML export that we already have? You cannot directly convert a document node into HTML, but you can delete everything from the document except that node or copy that node into an empty document and save to stream as HTML and you will get “formatted text”.

Thanks, Roman. Two points. First, the end of September may work, if that is your deadline. Second, my goal is to copy the text from the cells of a Word document’s table and paste them into a Rich Text Box in my application with all the formatting in tact: bolds, bullets, new lines, page breaks, underlines… all formatting exactly as it appeared in the original Word document.

For the next release, saving of individual nodes only to plain text format will be supported. There are several design questions we are still pondering at before we can allow export of individual nodes to other formats.
But you will be able to do what you want this way:

  1. Create a new empty document.
  2. Import the node (or nodes) that you are interested in from your original document into the new document. This is as easy as Document.ImportNode and append it to the new document.
  3. Save the new document in whatever format you want. You have just exported a fragment of your original document.

Gentlemen…

Has this issue been resolved?

Can we now read formatted text from a Word document and populate the formatted text into a Windows Forms Rich Text Box?

Vincent Serpico

You need to download Aspose.Words 4.0 and use export to RTF. Then you can open that RTF in your rich text box. You can save into a file or into a stream. At the moment you have to save all document, later we will add ability to save only fragments into RTF.

Thanks, Guys. Can I get a code snippet demo to show me how to read and export RTF from a Word Doc?

To save complete document to rtf:

Document doc = new Document("MyFile.doc");
doc.Save("MyFile.rtf", SaveFormat.FormatRtf);

To save only a fragment of document to rtf is a bit tricky at the moment. You need to either delete all nodes from the document except the ones you want to save and then save. Alternatively you can copy nodes you want to save into a blank document and then save it. Either way it requires dealing directly with nodes of the document. For example if you want to save just a single table to rtf:

Document srcDoc = new Document("MyFile.doc");
// Let's say the table we want to save is the first table in the document, get it.
Table srcTable = (Table)srcDoc.GetChild(NodeType.Table, 0, true);
Document dstDoc = new Document();
// This creates a copy of the table ready for adding to the new document.
Table dstTable = dstDoc.ImportNode(srcTable, true);
// Add the table to the main text of the first section.
dstDoc.Sections[0].Body.AppendChild(dstTable);
// Now dstDoc contains only the table, export to RTF.
dstDoc.Save("MyTable.rtf", SaveFormat.FormatRtf);

This feature seems to work pretty well, expect for reading hard breaks and returns. Hard breaks and returns are not read. When do you think you’ll have this important issue addressed?

Hi,
Do you mean this problem occurs when you export a fragment of document using the code snippet provided above or when you just save the entire document? Any type of breaks should be supported. Also, please attach the document you are working with as I should replicate the issue.

It happens using the code snippet above. We modified the code snippet a bit to:

private string getRTF(Cell srcCell)
{
    Document dstDoc = new Document();

    Cell dstCell = (Cell)dstDoc.ImportNode(srcCell, true);

    Table table = new Table(dstDoc);
    table.Rows.Add(new Row(dstDoc));
    table.Rows[0].Cells.Add(dstCell);

    dstDoc.Sections[0].Body.AppendChild(table);

    dstDoc.Save("report.rtf", SaveFormat.FormatRtf);

    this.richTextBox.LoadFile("report.rtf", RichTextBoxStreamType.RichText);
    return this.richTextBox.Rtf;
}

When I save your document to RTF using Aspose.Words, it looks like a twin of the original. Everything is preserved. Did you mean breaks are missed when viewed in RichTextBox? If so, this is not our issue but a limitation of the RichTextBox control. Use Microsoft Word to view RTF and you will see everything is fine.