XPath to make nodes complex searches faster?

prometil · May 23, 2013, 12:48pm

Hi,
I have read the few posts in public forums which talk about XPath use in Aspose products family.
In most cases, the answers consist in providing equivalent solutions in language apis (java, .net, …), i.e. in java by using document.getChildNodes(int nodeType, boolean isDeep)
In my case, I have to localize as faster as possible specific paragraphs in large Word documents (100 <= pages number <= 1000), i.e. paragraphs where outlineLevel equals 1.
So ok, I can search all nodes of type “Paragraph” and for each one, test if getParagraphFormat().outlineLevel + 1 == 1 but performances are very bad for large Word documents.
I do not know exactly xml scheme of Paragraph attribute but xpath request looking like “//Paragraph/ParagraphFormat[OutlineLevel=‘1’]” is it possible now or not ?
Thanks.

Sebastien

tahir.manzoor · May 24, 2013, 10:17am

Hi Sebastien,

Thanks for your inquiry. Only expressions with element names are supported at the moment. Expressions that use attribute names are not supported. In XPath you should use nodes from Aspose.Words DOM. Please read following documentation links for your kind reference.

https://reference.aspose.com/words/net/aspose.words/compositenode/selectsinglenode/
https://docs.aspose.com/words/net/aspose-words-document-object-model/

Document doc = new Document(MyDir + "Table.Document.doc");
// This expression will extract all paragraph nodes which are descendants of any table node in the document.
// This will return any paragraphs which are in a table.
NodeList nodeList = doc.SelectNodes("//Table//Paragraph");
// This expression will select any paragraphs that are direct children of any body node in the document.
nodeList = doc.SelectNodes("//Body/Paragraph");
// Use SelectSingleNode to select the first result of the same expression as above.
Node node = doc.SelectSingleNode("//Body/Paragraph");

prometil · May 28, 2013, 4:11am

Hi Tahir,

"Expressions that use attribute names are not supported"

Is full support of XPath features planned in any future version of Aspose.Word ?
In my opinion, it is a very important missing feature at this level, particularly in large Word documents where searches by iterations on large results sets like //Paragraph or //Body/Paragraph are not the most efficient approaches (like in my first post example).
Thanks.

Sebastien

tahir.manzoor · May 29, 2013, 1:30am

Hi Sebastien,

Thanks for your inquiry.

prometil:
In my case, I have to localize as faster as possible specific paragraphs in large Word documents (100 <= pages number <= 1000), i.e. paragraphs where outlineLevel equals 1.

Could you please attach your input Word document here for testing for which you are trying to search paragraphs with XPath (//Paragraph/ParagraphFormat[OutlineLevel=‘1’])? I will investigate the issue on my side and provide you more information.

prometil:
Is full support of XPath features planned in any future version of Aspose.Word ?

Please share your input document, I will investigate the issue according to your requirement and will log this missing feature.

prometil · May 29, 2013, 8:35am

Hi Tahir,

I can not post or give you my input Word documents because I have non divulgation agreements on theirs. But, I have made a Word document example (very simple) which contains similar document structure to my input documents but without “data volume”.
Anyway, when I make this XPath request: //Paragraph, I have 56 nodes in relative results set, but with //Paragraph/ParagraphFormat[OutlineLevel=‘1’], no result at all (normally, I think I should have 2 nodes : paragraphs which contain runs whose text is Title 2 or Title 3)
So maybe my XPath request is non consistent with your Word xml scheme, I am just based on “Java api relations” between Aspose.Word domain objects.
Thanks.

Sebastien

tahir.manzoor · May 30, 2013, 6:10am

Hi Sebastien,

Thanks for sharing the details. Firstly, I would suggest you please read the following articles that outlines Document Tree Navigation and DOM:
https://docs.aspose.com/words/java/navigation-with-cursor/
https://docs.aspose.com/words/java/aspose-words-document-object-model/

Secondly, I suggest you please use DOM instead of XPath to get nodes. Please try to execute the following code snippet at your end and see the difference of time in getting Paragraph nodes by using Document.selectNodes and Document.getChildNodes.

The Document.getChildNodes works fast as compared to Document.selectNodes (the XPath search). Hope this helps you. Please let us know if you have any more queries.

Document doc = new Document(MyDir + "test_aspose.doc");
Long t = System.currentTimeMillis();
NodeList nodeList = doc.selectNodes("//Paragraph");
System.out.println(System.currentTimeMillis() - t);
Long tDom = System.currentTimeMillis();
NodeCollection paras = doc.getChildNodes(NodeType.PARAGRAPH, true);
System.out.println(System.currentTimeMillis() - tDom);