Issue in splitting word document (multiple pages) to html pages in saving as an epub

I have an issue of converting word document (24 pages) to its epub. The requirement is to have separate html page for each page in word document. Ideally I should get 24 html pages. I am using below code,

String dataDir = "/home/nirmalap/workspace-word/wordSample1/";
FileInputStream fstream = new FileInputStream(dataDir + "Aspose.Words.lic"); 
License license = new License();	
license.setLicense(fstream);  

Document doc = new Document(dataDir + "01 Keown_Text_MS-10e_Ch01_vim_AJK.docx");
HtmlSaveOptions saveOptions = new HtmlSaveOptions(); 
saveOptions.setEncoding(Charset.forName("UTF-8")); 
saveOptions.setDocumentSplitCriteria(DocumentSplitCriteria.PAGE_BREAK); 
saveOptions.setExportDocumentProperties(true); 
saveOptions.setSaveFormat(SaveFormat.EPUB); 
doc.save(dataDir + "Document.EpubConversion_out.epub", saveOptions); 

But when I look at the epub it has only two html files. Could you let me know I am missing anything here?

Appreciate your earliest response.

@nirmalap

You are using HtmlSaveOptions correctly. The document is split into parts at explicit page breaks. when you use the property DocumentSplitCriteria(DocumentSplitCriteria.PAGE_BREAK). To achieve your requirement, you need to insert explicit page break at the end of each page.

Moreover, you can insert page break into document using Aspose.Words API. Please move the cursor to the last node of page and insert page break using DocumentBuilder.InsertBreak method. You can get the last node of page using layout API of Aspose.Words.

Thanks tahir. I tried to find sample code to lookup last node of a page in a word document. But was unable. Could you point me to couple of examples of how to do that.

@nirmalap

Please use the following code example to achieve your requirement. Hope this helps you.

    Document doc = new Document(MyDir + "in.docx");
    DocumentBuilder builder = new DocumentBuilder(doc);
    LayoutCollector collector = new LayoutCollector(doc);
    int page = 1;
    for (Paragraph  paragraph : (Iterable<Paragraph>) doc.getChildNodes(NodeType.PARAGRAPH, true))
    {
        if (collector.getStartPageIndex(paragraph) != page)
        {
            builder.moveTo(paragraph);
            builder.insertBreak(BreakType.PAGE_BREAK);
        }
    }

    HtmlSaveOptions saveOptions = new HtmlSaveOptions();
    saveOptions.setEncoding(Charset.forName("UTF-8"));
    saveOptions.setDocumentSplitCriteria(DocumentSplitCriteria.PAGE_BREAK);
    saveOptions.setExportDocumentProperties(true);
    saveOptions.setSaveFormat(SaveFormat.EPUB);
    doc.save(dataDir + “Document.EpubConversion_out.epub”, saveOptions);

Thanks Tahir. I have tried out the code but it doesn’t meet my requirement since when I open word document with MS word it will show only 23 pages, but code added total of 321 page breaks. Would it be possible to add single page break for each physical page in word document.

@nirmalap

Please ZIP and attach your input Word document and expected EPUB file here for testing. We will investigate the issue and provide you more information on it.

Let me further investigate what options we have and let you know exact requirement. Thanks for the support given.

@nirmalap

Please feel free to ask if you have any question about Aspose.Words, we will be happy to help you.

Just to add few more things to above we discussed. We have two requirements,

  1. convert a word document (book) into epub
  2. convert a pdf document (book) into epub

Does aspose.word support both these requirements. In the LoadFormat class we have PDF and in the SaveFormat class we have EPUB support in aspose API documentation. Can we utilize these features in meeting above two requirements.

For licensing and costing its better if we can get both features in a single product. Can you advise.

@nirmalap

Yes, you can import Word and PDF documents into Aspose.Words’ DOM and save the document to EPUB.

But I have tried this getting below error,

Exception in thread "main" com.aspose.words.UnsupportedFileFormatException: Pdf format is not supported on this platform. Use .NET Standard or .NET 4.6.1 version of Aspose.Words for loading Pdf documents.
	at com.aspose.words.zzZ34.zzLs(Unknown Source)
	at com.aspose.words.Document.zzY(Unknown Source)
	at com.aspose.words.Document.zzZ(Unknown Source)
	at com.aspose.words.Document.<init>(Unknown Source)
	at com.aspose.words.Document.<init>(Unknown Source)
	at word.sample.PdfSample.main(PdfSample.java:19)

code I am using is as below,

String dataDir = "/home/nirmalap/workspace-word/wordSample1/";
FileInputStream fstream = new FileInputStream(dataDir + "Aspose.Words.lic"); 
License license = new License();	
license.setLicense(fstream);  
		
Document doc = new Document(dataDir + "1621_ladder.pdf");		
PdfSaveOptions saveOptions = new PdfSaveOptions();
saveOptions.setDisplayDocTitle(true);		
doc.save(dataDir + "Test File.Pdf",saveOptions);
saveOptions.setSaveFormat(SaveFormat.EPUB);		
doc.save(dataDir + "Document.EpubConversion_out.epub", saveOptions);

Is something wrong in the code.

@nirmalap

Please use the latest version of Aspose.Words for .NET 20.9. If you still face problem, please ZIP and attach your input document here for testing. We will investigate the issue and provide you more information on it.

I am using aspose.word for java version 20.6 (aspose-words-20.6-jdk17.jar) not the .Net version. But I am getting above .Net version error. Please advice.

I have updated aspose.word for java version to 20.9 (aspose-words-20.9-jdk17.jar) but still getting same error. CVR_BERK3809_05_SE_BEP.pdf (120.3 KB)
File is attached for you to investigate further.

@nirmalap

Please accept my apologies for your inconvenience. Unfortunately, this feature is not available in Aspose.Words for Java. We logged this feature request as WORDSJAVA-2366 in our issue tracking system. You will be notified via this forum thread once this feature is available.

Thanks. Do you have a timeline when this feature is available.

@nirmalap

Unfortunately, there is no ETA available for this feature at the moment. We will inform you via this forum thread once there is an update available on it.

I have attached source code and the word document for you to investigate. My requirement is,

  • add a page break at the end of each physical page
  • DocumentSplitCriteria.PAGE_BREAK will break the document exactly the same number of htmls files
  • in the attached code it gives 83 as number of physical pages.
  • code adds 1645 page breaks instead 83
  • I hope doc.getPageCount() gives correct number of physical pagessample.zip (253.5 KB)

please advice.

@nirmalap

We have modified your code example to get the desired output. Hope this helps you.

    Document doc = new Document(MyDir + "Bozarth_ch05_ed.doc");
    doc.acceptAllRevisions();
    System.out.println("Actual page count : "+doc.getPageCount());

    addPageBreaks(doc);
    HtmlSaveOptions saveOptions = new HtmlSaveOptions();
    saveOptions.setEncoding(Charset.forName("UTF-8"));
    saveOptions.setDocumentSplitCriteria(DocumentSplitCriteria.PAGE_BREAK);
    saveOptions.setExportDocumentProperties(true);
    saveOptions.setSaveFormat(SaveFormat.EPUB);
    doc.save(MyDir + "20.9.epub", saveOptions);

    private static void addPageBreaks(Document doc) throws Exception {
        DocumentBuilder builder = new DocumentBuilder(doc);
        LayoutCollector collector = new LayoutCollector(doc);
        int page = 1;
        int count = 1;

        for (Paragraph  paragraph : (Iterable<Paragraph>) doc.getChildNodes(NodeType.PARAGRAPH, true))
        {
            if (collector.getStartPageIndex(paragraph) != page)
            {
                try {
                    builder.moveTo(paragraph);
                    builder.insertBreak(BreakType.PAGE_BREAK);
                    System.out.println("page break : "+ count++);
                    page++;
                }catch(Exception ex) {
                    System.out.println(ex.getMessage());
                }
            }
        }
    }

Thanks Tahir. But now I am getting 680 html pages in the epub. Ideally it should be 83 html pages, which should be equal to the value return from doc.getPageCount(). I have attached the word document, updated code and the epub for you to investigate. Still it doesn’t meet my requirement. Please advice.

sample2.zip (896.0 KB)