Hi I’m using code based on your EnumerateLayoutElements sample to extract text from a word document.
When using the LayourEnumerator class I’m having issues trying to extract the text from a text box on the page. When the iterator reaches the text box object (which does contain text), it’s LayoutEntityType is “Span” and the value of Kind is “SHAPE”, but the value of Text is null.
Is there anyway to extract the text from a text box when using the LayoutEnumerator or is there a way I can get to the corresponding paragraph node from the current value of the iterator, and perhaps extract text from the paragraph node?
You should be able to recreate this problem by simply trying to process a document with a text box using your EnumerateLayoutElements sample.
Thanks
Tim
Hi Tim,
Thanks for your inquiry. You can extract text from TextBox using following code example. Regarding issue with EnumerateLayoutElements, I am checking this issue and will update you asap.
Document doc = new Aspose.Words.Document(MyDir + "Test01.docx");
foreach (Shape shape in doc.GetChildNodes(NodeType.Shape, true))
{
if (shape.TextBox != null)
{
Console.WriteLine(shape.ToString(SaveFormat.Text));
Paragraph para = shape.ParentParagraph;
}
}
Hi Tim,
TCowell:
When using the LayourEnumerator class I’m having issues trying to extract the text from a text box on the page. When the iterator reaches the text box object (which does contain text), it’s LayoutEntityType is “Span” and the value of Kind is “SHAPE”, but the value of Text is null.
I have tested the scenario and have not found the shared issue while using latest version of Aspose.Words for .NET 14.1.0. I suggest you please use the latest version of Aspose.Words for .NET 14.1.0.
Regarding extracting text from Textbox, I suggest you, please use the Node.ToString Method (SaveFormat.Text) to get the text of a Node.
Thanks Tahir,
That code would be useful but currently we’re using the LayoutEnumerator to get all the text for a particular page, in the order it is displayed on page. I’d be interested if there is a way I can map from the current value of the LayoutEnumerator to a node?
I’ve tried using the ‘Current’ value of the enumerator but this does not match to any shapes. The following code is an example of what I’m trying…
if ((layoutEnumerator.Type == LayoutEntityType.Span) && (layoutEnumerator.Kind == "SHAPE"))
{
foreach (Shape shape in doc.GetChildNodes(NodeType.Shape, true)
{
Object nodeObj = layoutCollector.GetEntity(shape);
if (nodeObj != null)
{
if (nodeObj == layoutEnumerator.Current)
{
// Get node text…
}
}
}
}
But this code never finds a matching node, is this assumption about ‘Current’ wrong? Is there some other way I can use it to map to a node?
Thanks
Tim
I am using 14.1.0 already. Do you mean you were able to see extracted text from a text box using the sample code? Could you share the document you used? So I can try it myself.
Thanks
Tim
Hi Tim,
Thanks for your inquiry. The LayoutCollector.GetEntity method returns an opaque position of the LayoutEnumerator which corresponds to the specified node. You can use returned value as an argument to Current given the document being enumerated and the document of the node are the same.
In your case, I suggest you please use the following code example to get the text with page numbers. Hope this helps you. Please let us know if you have any more queries.
Document doc = new Aspose.Words.Document(MyDir + "in.docx");
LayoutCollector layoutCollector = new LayoutCollector(doc);
LayoutEnumerator layoutEnumerator = new LayoutEnumerator(doc);
var collection = doc.GetChildNodes(NodeType.Paragraph,
true);
foreach (Paragraph para
in collection)
{
if (para.GetChild(NodeType.Paragraph,
0, true) == null)
{
var renderObject = layoutCollector.GetEntity(para);
layoutEnumerator.Current = renderObject;
int page = layoutEnumerator.PageIndex;
Console.WriteLine(page + " - " + para.ToString(SaveFormat.Text));
}
}
Hi Tim,
Thanks for your inquiry.
You may also want to look at the DocumentLayoutHelper example which provides a wrapper API for the LayoutEnumerator and provides a property to return the node for a given layout element. Along with Tahir’s first code snippet this should help you achieve what you are looking for.
Cheers,
Hi Adam,
Great tip, I hadn’t spotted that aspect of the sample code.
The sample gave me the idea that I could initially iterate over all shape nodes in the document and use “layoutCollector.GetEntity()” to set the value of “layoutEnumerator.Current” and use this to create a dictionary of object references mapped to nodes. I could then use this dictionary to look up the shape node item as I’m iterating over the document using the LayoutEnumerator, and therefore allow me to access the text in a TextBox.
Thanks
Tim
Hi Tim,
Great to hear you found what you were looking for. Happy coding!
Cheers,