Iterate over the document

Hi Team,

I want to traverse through the word document as if it is word template and read all nodes, formatting and everything from it using Aspose.Words APIs.

I could find related documentation here. I hope the documentation is up to date.
Does is support reading images?

Thanks,
Kumar

Hi Kumar,

Thanks for your inquiry. Please use following code example to iterate through all nodes of document.

Document doc = new Document(MyDir + "in.docx");
Node node = doc;
while (node != doc.getLastSection().getBody().getLastParagraph().getLastChild())
{
    node = node.nextPreOrder(doc);
    System.out.println(Node.nodeTypeToString(node.getNodeType()));
}

You can read images using Aspose.Words. Shape class represents an object in the
drawing layer, such as an AutoShape, textbox, freeform, OLE object,
ActiveX control, or picture. Shape.HasImage property returns true if the shape has image bytes or links an image.

Please note that formatting is
applied on a few different levels. For example, let’s consider
formatting of simple text. Text in documents is represented by Run
element and a Run can only be a child of a Paragraph. You can apply
formatting

  1. to Run nodes by using Character Styles e.g. a Glyph Style,
  2. to the parent of those Run nodes i.e. a Paragraph node (possibly via paragraph Styles)
  3. you can also apply direct formatting to Run nodes by using Run attributes (Font). In this case the Run will inherit formatting of Paragraph Style, a Glyph Style and then direct formatting.

Shape.Font provides access to the font formatting of this object.

Hope this answers your query. Please let us know if you have any more queries.

Thanks Tahir. I’ll get back if there are any queries around this.

Hi Tahir,

What was the reason for your suggestion to iterate over the document through doc.sections instead of DocumentBuilder?

How do we read/extract list with your above example?

Thanks,
Kumar

Hi Kumar,

Thanks for your inquiry.

kumaraswamy.m:
What was the reason for your suggestion to iterate over the document through doc.sections instead of DocumentBuilder?

Unfortunately, I have not understood your query. Could you please share some more detail about your query?

kumaraswamy.m:
How do we read/extract list with your above example?

You can check the node type of a Node using Node.NodeType property. If NodeType is Paragraph, please use Paragraph.ListFormat.IsListItem to check either paragraph is list item or not.

Hi Tahir,

I meant, why can’t we use docBuilder object to iterate over the document instead of doc.sections as in the below code. Are the any advantages with the method you suggested?

String docPath = "c:\\template.doc";
Document doc = new Document(docPath);
DocumentBuilder docBuilder = new DocumentBuilder(doc);

Thanks,
Kumar

Hi Kumar,

Thanks for your inquiry. DocumentBuilder is a powerful class that is associated with a Document and allows dynamic document building from scratch or the addition of new elements to an existing document. It provides methods to insert text, paragraphs, lists, tables, images and other contents, specification of font, paragraph, and section formatting, and other things. Using DocumentBuilder is somewhat similar in concept to using the StringBuilder class of the .NET Framework.

However, the Document is a root node of a tree that contains all other nodes of the document. The tree is a Composite design pattern and in many ways similar to XmlDocument.

Please check following class hierarchy for Document and DocumentBuilder. Node class is the parent class of Document so you can use the Node class members using Document’s object.

System.Object
Aspose.Words.Node
Aspose.Words.CompositeNode
Aspose.Words.DocumentBase
Aspose.Words.Document

---------------------------------------------------------------

System.Object
Aspose.Words.DocumentBuilder

Hope this answers your query. Please let us know if you have any more queries.

Hi Tahir,

>>> You can check the node type of a Node using Node.NodeType property. If NodeType is Paragraph, please use Paragraph.ListFormat.IsListItem to check either paragraph is list item or not.
Thanks. It helped. However, how can I get
- whether a list style is a bullet or a number
- get bullet style

Thanks,
Kumar

I can use listLevel.getNumberStyle() (NumberStyle.BULLET) to determine if it is a bullet. However, how can I get the bullet style? using listLevel.getNumberFormat() is not helping.

Hi Kumar,

Thanks
for your inquiry. Please use List.Style Property to get the list style that this list references or defines.

If you want to get the value of ListTemplate e.g NumberUppercaseLetterDot, NumberLowercaseLetterDot etc, unfortunately this feature is not available at the moment.

Could you please share some detail about your requirements along with example Word document? We will then provide you more information about your query.

Hi Tahir,

>>> If you want to get the value of ListTemplate e.g NumberUppercaseLetterDot, NumberLowercaseLetterDot etc, unfortunately this feature is not available at the moment.
Yes. Additionally, is the bullet a tick symbol, circle, square. If all these are available, could you create an enhancement request for Aspose.Words?

>>> Could you please share some detail about your requirements along with example Word document? We will then provide you more information about your query.
In general, I want to read all the elements, formatting from a Word document template and convert it to our internal language so that one can build on top of it instead of creating something from scratch.
The main technical need is to be able to read everything within a document using Aspose.Words API.

Thanks,
Kumar

Hi Kumar,

Thanks
for sharing the detail.

We have already logged this feature request as WORDSNET-7562
in our issue tracking system. You will be notified via this forum
thread once this feature is available.

We apologize for your
inconvenience.

Hi Tahir,

Couple of queries.

Extract merge details.
I created word document manually and merged two cell horizontally to a single cell. However, I cannot extract the merge information using Aspose API. See attached table.dot document.

Soft enter issue
If I used soft enter (shft+enter), how do I read it? When I read this node and print it, the text is printed as Test?best. See attached document.

Run run = (Run) node;
run.getText()

Thanks,
Kumar

Hi Kumar,

Thanks for your inquiry.
kumaraswamy.m:
Extract merge details.
I created word document manually and merged two cell horizontally to a single cell. However, I cannot extract the merge information using Aspose API. See attached table.dot document.
The fact is that by Microsoft Word design, rows in a table in a Microsoft Word document are completely independent. It means each row can have any number of cells of any width. So if you imagine first row with one wide cell and second row with two narrow cells, then looking at this document the cell in the first row will appear horizontally merged. But it is not a merged cell; it is just a single wide cell.

Please read following link for your kind reference.
Working with Merged Cells

kumaraswamy.m:
Soft enter issue
If I used soft enter (shft+enter), how do I read it? When I read this node and print it, the text is printed as Test?best. See attached document.

Run run = (Run) node;
run.getText()
You can check either a Run node contains line break or not using following highlighted code snippet. Hope this helps you.

NodeCollection runs =
doc.getChildNodes(NodeType.RUN, true);
for (Run run : (Iterable)runs)
{
    System.out.println(run.getText().contains(ControlChar.LINE_BREAK));
    System.out.println(run.getText());
}

Hi Tahir,

Regarding merge, does it mean that the merge information cannot be extracted from manually create MS document with horizontal merge? The doc is not very clear…

I’ll try regarding shift enter.

Thanks,
Kumar

Hi Kumar,

Thanks for your inquiry. We would like to clarify a bit regarding horizontally merged cells. If you imagine first row with one wide cell and second row with two narrow cells, then looking at this document the cell in the first row will appear horizontally merged. But it is not a merged cell; it is just a single wide cell.

Another perfectly valid scenario is when the first row has two cells. First cell has CellMerge.First and second cell has CellMerge.Previous, in this case it is a merged cell. In both cases, the visual appearance in MS Word is exactly the same. Both cases are valid.

Here is simple code, which demonstrates the described things.

// Create empty document and DocumentBuilder object.
Document doc = new Document();
DocumentBuilder builder = new DocumentBuilder(doc);
// Configure DocumentBuilder
builder.getCellFormat().getBorders().setLineStyle(LineStyle.SINGLE);
builder.getCellFormat().getBorders().setColor(Color.BLACK);
// Build table, with simply wide cells.
// First row will contains simply wide cell and in MS Word it will look like merged.
builder.insertCell();
builder.getCellFormat().setWidth(200);
builder.write("This is simply wide cell");
builder.endRow();
// Insert the second row
builder.insertCell();
builder.getCellFormat().setWidth(100);
builder.insertCell();
builder.getCellFormat().setWidth(100);
builder.endRow();
builder.endTable();
// Insert few paragraphs between table.
builder.writeln();
builder.writeln();
// Build table, with merged cells.
// First row will contains merged cells.
builder.insertCell();
builder.getCellFormat().setWidth(100);
builder.getCellFormat().setHorizontalMerge(CellMerge.FIRST);
builder.write("This is merged cells");
builder.insertCell();
builder.getCellFormat().setWidth(100);
builder.getCellFormat().setHorizontalMerge(CellMerge.PREVIOUS);
builder.endRow();
// Insert the second row
builder.insertCell();
builder.getCellFormat().setWidth(100);
builder.getCellFormat().setHorizontalMerge(CellMerge.NONE);
builder.insertCell();
builder.getCellFormat().setWidth(100);
builder.getCellFormat().setHorizontalMerge(CellMerge.NONE);
builder.endRow();
builder.endTable();
// Save output document
doc.save("C:\Temp\out.doc");

Hi Tahir,

Could you please answer the question I asked?
I created a table in MS Word manually. I merged two cells into a single cell. Can I read such merge information from Aspose Java API on the same document? Yes / No?

What I understand is that, if I use Aspose API to build such a table with merged cell, I’ll be able to read back the merge information. Is that correct?

Thanks,
Kumar

Hi Kumar,

Thanks for your inquiry. The answer to your both questions is “no”. The fact is that by Microsoft Word design, rows in a table in a Microsoft Word document are completely independent. It means each row can have any number of cells of any width. Please let us know if you have any more queries.

Hi Tahir,

If I’ve a ToC field and then a text (say “Introduction” with Heading 1), I noticed that a hidden bookmark _ToCxxxxxx is created. Is there a way to identify that such a bookmark is created for ToC purpose?

Currently, I check with the bookmark starts with _ToC. But is there a better way to identify it? like an API to check if the bookmark is hidden, bookmark is auto create for ToC purpose, etc.

Thanks,
Kumar

Hi Kumar,

Thanks for your inquiry. The name of hidden bookmark begins with an underscore character (_). There is no API to check either a bookmark is hidden or not. However, you can check it by iterating through bookmarks. Please check if bookmark name starts with _Toc. Hope this helps you. Please let us know if you have any more queries.