Thanks for your patience. You are creating around 7680 documents. The performance and memory usage all depend on complexity and size of the documents you are generating. In your case, we suggest you please use small size input documents instead of one big size document.
Hi Tahir,
As per your comment performance is the issue with large document, and I found that once we parse long document and try to extract section (Paragraphs heading 1) or Heading Text with style heading 1 one by one.
For initial sections as per below example sections (1,2,3,4) not taking more than 1 minute, once we go in to deep sections like 15, 26 or 88 in that case time will be increase by 5 to 30 minute.
Ex: Demo sections…
Section : Heading : Extraction Time
InformationModel : 20 second
Motivation for learning English : 30 second
What is necessary to learn English well? : 55 second
Notes Part1 : 60 second
What is Notes Part1 : 2 minute
Results : 4 minute
Conclusion : 9 minute
Final conclusion : 20 minute
88 Read Unlimited : 26 minute
Last Page : 30 minute
So, I just want to know that is there any mechanism through which we can extract any section by heading text with in 1 minute?
Does, Aspose word support like xpath type of mechanism to extract particular section by heading text ?
For more detail please find attached input documents and I also attached source code.
Thanks for your inquiry. In your case, we suggest you please use multi-threading. The only thing you need to make sure is that always use separate Document instance per each thread. One thread should use one Document object. Moreover, you need to increase the size of memory to generate large number of documents. Hope this helps you.
You can use CompositeNode.SelectNodes method to select list of nodes matching the XPath expression. However, expressions that use attribute names are not supported. It is not supported for your case.
As per your suggestion we have implement multithreading code to read different sections by different threads. By doing this performance of process increased. Previously it’s taking around 20 minute now it’s taking 8 minute.
For example: When I run my code 4 time on same file: Source.docx which size: 4204 kb, Number of pages :1271 and Total sections :12.
Processing time with different scenario are: Scenario 1: When, I read first 8 sections, from 1-8. it’s take time : 2 min, 17 sec Scenario 2: When, I read last 8 sections, from 5-12. it’s take time : 5 min, 50 sec Scenario 3: When, I read 8 sections from : 1-4 and 9 -12, it’s take time : 5 min, 16 sec Scenario 4: When, I read complete 12 sections, it’s take time : 7 min, 10 sec.
I just want to know that once we go into deep sections, extraction time is increasing.
It’s expected behaviours of Aspose api? or we are missing something?
Note: Sections are our logical section based on bookmark(Paragraphs heading 1), as mention in previous query.
Thanks for your inquiry. You are facing the expected behavior. We suggest you please split your input document to small size documents and use your code to extract the content.
You may save your document to HTML documents and join them again. Following code example shows how to split the document at heading level 1. This code example takes around one minute to export HTML documents. Hope this helps you.
Document doc = new Document(MyDir + "Input_Sample.doc");
HtmlSaveOptions options = new HtmlSaveOptions();
options.setDocumentSplitHeadingLevel(1);
options.setDocumentSplitCriteria(DocumentSplitCriteria.HEADING_PARAGRAPH);
doc.save(MyDir + "out.html", options);
Thanks for your inquiry. We have logged a feature request as WORDSNET-16966 to split Word document according to DocumentSplitCriteria.HeadingParagraph. We will inform you via this forum thread once there is any update available on this feature. We apologize for your inconvenience.
Please use the following code example with latest version of Aspose.Words for Java 18.6. It takes around one minute for 13.5 MB document and generates 7678 documents.
public static void extractDocument(Document doc) throws Exception {
DocumentBuilder builder = new DocumentBuilder(doc);
NodeCollection nodes = doc.getChildNodes(NodeType.PARAGRAPH, true);
ArrayList headings = new ArrayList();
for (Paragraph para : (Iterable<Paragraph>) nodes) {
if (para.getParagraphFormat().isHeading()) {
headings.add(para);
}
}
System.out.println("Number of pargraphs ::"+headings.size());
ArrayList extractedNodes;
for (int i = 1; i < headings.size() - 1; i++) {
extractedNodes = ExtractContents.extractContent((Node)headings.get(i), (Node)headings.get(i+1), false);
ExtractContents.generateDocument(doc, extractedNodes).save(MyDir + "out//out"+i+".xml", SaveFormat.WORD_ML);
System.out.println("Saveing "+i);
}
}