Processing time is too much slow

Hi Team

As per our requrement we need to extract docx file into xml by paragraph or bookmark and again need to genereate docx file using extracted xml.

It’s work fine with small file with 1 or 2 mb, but when we try to extract large file with 10 mb it’s taking around 20 to 30 min.

Is there any optimization process which reduce extraction and building time?

Please find attache java code and Input document.
For aspose.zip (709.9 KB)

Thank you
Purushottam Sadh

@purusadh2003,

Thanks for your inquiry. Please note that performance and memory usage all depend on complexity and size of the documents you are generating.

We are investigating your issue and will get back to you soon.

@purusadh2003,

Thanks for your patience. You are creating around 7680 documents. The performance and memory usage all depend on complexity and size of the documents you are generating. In your case, we suggest you please use small size input documents instead of one big size document.

Hi Tahir,
As per your comment performance is the issue with large document, and I found that once we parse long document and try to extract section (Paragraphs heading 1) or Heading Text with style heading 1 one by one.

For initial sections as per below example sections (1,2,3,4) not taking more than 1 minute, once we go in to deep sections like 15, 26 or 88 in that case time will be increase by 5 to 30 minute.

Ex: Demo sections…

  • Section : Heading : Extraction Time

    1.      InformationModel 			     : 20 second				
      
    1.     Motivation for learning English		  : 30 second
      
    1.  What is necessary to learn English well?	  : 55 second
      
    1.     Notes Part1 				: 60 second
      
    1.  What is Notes Part1  		: 2 minute
      
    1.   Results					 : 4 minute
      
    1.   Conclusion				 : 9 minute
      
    1.    Final conclusion			 : 20 minute
      
  • 88 Read Unlimited : 26 minute

    1.   Last Page 				: 30 minute
      

So, I just want to know that is there any mechanism through which we can extract any section by heading text with in 1 minute?

Does, Aspose word support like xpath type of mechanism to extract particular section by heading text ?

For more detail please find attached input documents and I also attached source code.

For aspose.zip (704.9 KB)

Thank you!

@purusadh2003,

Thanks for your inquiry. In your case, we suggest you please use multi-threading. The only thing you need to make sure is that always use separate Document instance per each thread. One thread should use one Document object. Moreover, you need to increase the size of memory to generate large number of documents. Hope this helps you.

You can use CompositeNode.SelectNodes method to select list of nodes matching the XPath expression. However, expressions that use attribute names are not supported. It is not supported for your case.

Hi Tahir,
Thanks for reply.

As per your suggestion we have implement multithreading code to read different sections by different threads. By doing this performance of process increased. Previously it’s taking around 20 minute now it’s taking 8 minute.

For example: When I run my code 4 time on same file: Source.docx which size: 4204 kb, Number of pages :1271 and Total sections :12.

Processing time with different scenario are:
Scenario 1: When, I read first 8 sections, from 1-8. it’s take time : 2 min, 17 sec
Scenario 2: When, I read last 8 sections, from 5-12. it’s take time : 5 min, 50 sec
Scenario 3: When, I read 8 sections from : 1-4 and 9 -12, it’s take time : 5 min, 16 sec
Scenario 4: When, I read complete 12 sections, it’s take time : 7 min, 10 sec.

I just want to know that once we go into deep sections, extraction time is increasing.

It’s expected behaviours of Aspose api? or we are missing something?

Note: Sections are our logical section based on bookmark(Paragraphs heading 1), as mention in previous query.

Thank you!

@purusadh2003,

Thanks for your inquiry. You are facing the expected behavior. We suggest you please split your input document to small size documents and use your code to extract the content.

You may save your document to HTML documents and join them again. Following code example shows how to split the document at heading level 1. This code example takes around one minute to export HTML documents. Hope this helps you.

Document doc = new Document(MyDir + "Input_Sample.doc");
HtmlSaveOptions options = new HtmlSaveOptions();
options.setDocumentSplitHeadingLevel(1);
options.setDocumentSplitCriteria(DocumentSplitCriteria.HEADING_PARAGRAPH);

doc.save(MyDir + "out.html", options);

Hi Tahir,

I have applied suggested approach to save document as html and again rebuild docx from html.

I am facing a lot of styling issue when we generate docx from html.

  1. Formulas are saved as images, after building docx, we can’t modify it.
  2. bullet points are repeating for all type of heading (heading 1,heading 2).
  3. Table of content not populate proper
  4. Left and right margins not comes proper with paragraphs
  5. It’s not maintain page orientation. (Landscape and portrait)

For aspose html to docx.zip (750.5 KB)

Please find attached java code with source and output documents.

Thanks
Purushottam

@purusadh2003,

Thanks for your inquiry. Please try following code example. Hope this helps you.

Document doc = new Document(MyDir + "source.docx");
HtmlSaveOptions options = new HtmlSaveOptions();
options.setDocumentSplitHeadingLevel(1);
options.setDocumentSplitCriteria(DocumentSplitCriteria.HEADING_PARAGRAPH);
options.setExportPageSetup(true);
options.setExportPageMargins(true);
options.setOfficeMathOutputMode(HtmlOfficeMathOutputMode.MATH_ML);

doc.save(MyDir + "out.html", options);

In this case, we suggest you please remove the TOC field from the source document and insert insert it again at the same location in final document.

Hi Tahir,

Thanks for update.

I made code changes suggested by you in our code. Issue related to formula has been fixed but other issue are still remain same.

  1. Bullet points are repeating for all type of heading (heading 1,heading 2).
  2. Left and right margins not comes proper with paragraphs as per source document.
  3. It’s not maintain page orientation. (Landscape and portrait)
  4. Alignment issue at multiple location.

Could please suggest us proper solution.

Bullet issue.docx_Word.png (43.1 KB)
alignment issue.png (59.0 KB)

Thank you!
Purushottam

@purusadh2003,

Thanks for your inquiry. We have logged a feature request as WORDSNET-16966 to split Word document according to DocumentSplitCriteria.HeadingParagraph. We will inform you via this forum thread once there is any update available on this feature. We apologize for your inconvenience.

@purusadh2003,

Thanks for your patience.

Please use the following code example with latest version of Aspose.Words for Java 18.6. It takes around one minute for 13.5 MB document and generates 7678 documents.

public static void extractDocument(Document doc) throws Exception {
    DocumentBuilder builder = new DocumentBuilder(doc);
    NodeCollection nodes = doc.getChildNodes(NodeType.PARAGRAPH, true);

    ArrayList headings = new ArrayList();

    for (Paragraph para : (Iterable<Paragraph>) nodes) {
        if (para.getParagraphFormat().isHeading()) {
            headings.add(para);
        }
    }

    System.out.println("Number of pargraphs ::"+headings.size());

    ArrayList extractedNodes;
    for (int i = 1; i < headings.size() - 1; i++) {

         extractedNodes = ExtractContents.extractContent((Node)headings.get(i), (Node)headings.get(i+1), false);
        ExtractContents.generateDocument(doc, extractedNodes).save(MyDir + "out//out"+i+".xml", SaveFormat.WORD_ML);
        System.out.println("Saveing "+i);
    }
}