Splitting docx into multiple HTML documents

karuneshbsarad · January 25, 2017, 5:53am

How to split document into multiple HTML files using aspose.words for java framework?

tahir.manzoor · January 26, 2017, 12:21am

Hi there,

Thanks for your inquiry. You can extract the contents from the document and save it into html file format. Please refer to the following article:
Extract Selected Content Between Nodes

If you want to extract specific page/pages from Word document, you can achieve this using the PageSplitter example project. You can find PageSplitter code in Aspose.Words for Java examples repository at GitHub. Please check following code example for your kind reference.

If this does not help you, please share some more detail about your query along with input and expected output documents. We will then provide you more information about your query along with code.

// Load the document
Document doc = new Document(MyDir + "in.docx");
// Create and attach collector to the document before page layout is built.
LayoutCollector layoutCollector = new LayoutCollector(doc);
// Split nodes in the document into separate pages.
DocumentPageSplitter splitter = new DocumentPageSplitter(layoutCollector);
// Get the first page of document and save it to html
Document newDoc = splitter.GetDocumentOfPage(1);
newDoc.save(MyDir + "Out.html");

karuneshbsarad · January 26, 2017, 4:40am

We want to split file into multiple html files using heading1/2 in the document. Can you please guide me on the same?

tahir.manzoor · January 27, 2017, 12:02am

Hi there,

Thanks for sharing the detail. In this case, we suggest you please use HtmlSaveOptions.DocumentSplitCriteria property with value DocumentSplitCriteria.HEADING_PARAGRAPH. This property specifies how the document should be split when saving to Html or Epub format.

You can use HtmlSaveOptions.DocumentSplitHeadingLevel property to specify the maximum level of headings at which to split the document. Default value is 2. When DocumentSplitCriteria includes HeadingParagraph and this property is set to a value from 1 to 9, the document will be split at paragraphs formatted using Heading 1, Heading 2 , Heading 3 etc. styles up to the specified heading level.

By default, only Heading 1 and Heading 2 paragraphs cause the document to be split. Setting this property to zero will cause the document not to be split at heading paragraphs at all.

Document doc = new Document(MyDir + "in.docx");
HtmlSaveOptions options = new HtmlSaveOptions();
options.setDocumentSplitCriteria(DocumentSplitCriteria.HEADING_PARAGRAPH);
options.setDocumentSplitHeadingLevel(1);
doc.save(MyDir + "Out v17.1.0.html", options);