Word - convert doc to epub - XMLStreamException: No open start element- when trying to write end element

ollav · August 6, 2012, 2:34pm

I just converted 5,000 Word doc files into ePubs using Aspose and for the most part the results were great, but for a few hundred files I had the XMLStreamException when executing doc.save. I found that this happens when I use DocumentSplitCriteria=PAGE_BREAK but when I switch to HEADING_PARAGRAPH the issue does not occur. In order for our eBooks to open quickly, PAGE_BREAK seems to work better (and is more consistent with chapter breaks). As you can see most of the code revolves around trying to be sure there is a reasonable amount of breaks in the document.

I run the conversions in ColdFusion. I would be happy to provide source documents, but I cannot share them publicly. If you can provide me a way to send them (email?) then I will happily send examples.

Here is the part of my code that uses Aspose. (TitleData is query data.)

<cfset LicFile=CreateObject("java", "java.io.FileInputStream").init(JavaCast("string","/myhouse/java/aspose/words/Aspose.Total.Java.lic"))>
<cfset CreateObject("java", "com.aspose.words.License").setLicense(LicFile)>

<cfset doc=CreateObject("java", "com.aspose.words.Document").init(arguments.sourceDoc)>

<cfset paragraphs=doc.getChildNodes(CreateObject("java", "com.aspose.words.NodeType").PARAGRAPH, true)>

<cfset ControlChar=CreateObject("java", "com.aspose.words.ControlChar").page_break>

<cfif isDefined("curNode")>
<cfif curNode.getParagraphFormat().getPageBreakBefore()>

<cfif isDefined("curRun") and curRun.Text.Contains(ControlChar)>

<cfif totalOriginalPageBreaks gt (maxKeys/500)>

<cfif isDefined("curNode")>
<!— 
check the previous and next node to be reasonably sure we’re not within the table of contents 
also, hopefully if a normal paragraph begins with the word chapter, hopefully it is less than 100 chars
—>
<cfif
curkey lt maxKeys
AND len(trim(curNode.getText())) lt 100
AND compareNoCase(left(trim(curNode.getText()), len("chapter")), "chapter") is 0
AND compareNoCase(left(trim(paragraphs.get(curkey+1).getText()), len("chapter")), "chapter") neq 0
AND compareNoCase(left(trim(paragraphs.get(curkey-1).getText()), len("chapter")), "chapter") neq 0
\>

<cfset curNode.getParagraphFormat().setPageBreakBefore(true)>

<cfset curNode.getParagraphFormat().setPageBreakBefore(false)>

<cfif isDefined("curNode")>
<cfset curNode.getParagraphFormat().setPageBreakBefore(true)>

<cfset SaveFormat=CreateObject("java", "com.aspose.words.SaveFormat")>
<cfset saveOptions=CreateObject("java", "com.aspose.words.HtmlSaveOptions").init(SaveFormat.ePub)>

<cfset saveOptions.setDocumentSplitCriteria(CreateObject("java", "com.aspose.words.DocumentSplitCriteria").PAGE_BREAK)>
<cfset saveOptions.setDocumentSplitHeadingLevel(9)>

<cfset saveOptions.setSaveFormat(SaveFormat.EPUB)>
<cfset saveOptions.setEncoding(CreateObject("java", "java.nio.charset.Charset").forName("UTF-8"))>

<cfset doc.save(arguments.ePubPath & "/" & arguments.ePubFileName, saveOptions)>

As a side note, I was looking at Calibre and it seems to split the documents based on byte count instead of the types of criteria that Aspose provides. It would be useful if Aspose allowed you to set a byte count and just broke the files, but did not assign them as chapters (change how the NCX and OPF is written).

tilal.ahmad · August 7, 2012, 12:47pm

Hi,

Thanks for your request. Could you please share sample java code along with sample document? We will test it at our end and suggest you accordingly.

Although, sharing documents over forum is also safe as only the owner and the Aspose staff has access to these. But you can also share the documents via email using contact tab in the post.

Please feel free to contact us for any further assistance.
Best Regards,

ollav · August 9, 2012, 5:36pm

Did you receive the Word files I sent via email?

If I write a stand alone code sample in ColdFusion will you be able to run that or do you need a Java version?

tilal.ahmad · August 10, 2012, 12:56am

Hi,

Thanks I have received the sample documents, attached in your email. I would appreciate if you please provide java code sample for investigating the issue at our side.

Best Regards,

wrood · August 15, 2012, 3:57pm

Here is a Java version of the code that produces the error.

import java.io.*;
import com.aspose.words.License;
import com.aspose.words.Document;
import com.aspose.words.NodeType;
import com.aspose.words.*;
import com.aspose.words.SaveFormat;
import com.aspose.words.HtmlSaveOptions;
import java.nio.charset.Charset;

//import java.io.BufferedReader;
//import java.io.FileReader;
import java.io.FileInputStream;

public class epub {

    /*** @param args
    */
    public static void main(String[] args) {
        // TODO Auto-generated method stub
        
        // This will produce an epub
        //String wordDocFileName = "aspose_test_case.doc";
        //String epubFilename = "aspose_test_case.epub";
        
        // This will cause the exception error
        String wordDocFileName = "aspose_error_case.doc";
        String epubFilename = "aspose_error_case.epub";

        String sPath = "U://eclipse//aspose//src//";

        String strLicense = "Aspose.Total.Java.lic";
        //File file = new File(sPath + strLicense);
        //FileInputStream licFile;

        License asposeLic;
        System.out.println("Hello step 1");
        try {
            System.out.println("Hello step 2");
            asposeLic = new License();
            asposeLic.setLicense(new FileInputStream(sPath + strLicense));
            System.out.println("Hello step 3");
        } catch (Throwable t) {
            System.out.println("Error reading License File");
            System.exit(0);
        };

        int iTotalOrgPageBreaks = 0;
        int iRuns = 0;

        boolean insertedBreak = false;

        System.out.println("Hello step 4");

        //try {
        // asposeLic.setLicense(strLicense); 
        //} catch (Throwable t) {
        // System.out.println("Error reading Aspose License file.");
        // System.exit(0);
        //} 

        try {

            System.out.println("Hello step 6");

            Document wordDoc = new Document(sPath + wordDocFileName);
            wordDoc.getBuiltInDocumentProperties().setAuthor("Random House, Inc");
            wordDoc.getBuiltInDocumentProperties().setTitle("Random House Test Document - all rights reserved");

            System.out.println("Hello step 7");

            NodeCollection paragraphs = wordDoc.getChildNodes(NodeType.PARAGRAPH, true);

            String sPageBreak = ControlChar.PAGE_BREAK;
            int maxKeys = paragraphs.getCount();
            for (int i=2; i <= maxKeys; i++) {
                Paragraph curNode = (Paragraph) paragraphs.get(i);
                System.out.println("Hello step 7 [" + i + "]");
                if ( curNode != null ) {
                    if (curNode.getParagraphFormat().getPageBreakBefore()) {
                        System.out.println("Hello step 7 [" + i + "] Page Break");
                        iTotalOrgPageBreaks++;
                    };
                    for (int j=1; j <=curNode.getRuns().getCount(); j++){
                        Run curRun = curNode.getRuns().get(j);
                        System.out.println("Hello step 7 [" + i + "] [" + j + "]");
                        if (curRun != null) {
                            if (curRun.toString().contains(sPageBreak)) {
                                System.out.println("Hello step 7 [" + i + "] [" + j + "] Page Break");
                                iTotalOrgPageBreaks++;
                            };
                        };
                    }
                };

            };

            System.out.println("Hello step 8");

            if (iTotalOrgPageBreaks > (maxKeys/500) ) {
                insertedBreak = true;
            };

            if ( !insertedBreak ) {
                for (int i = 2; i <= maxKeys; i++) {
                    Paragraph curNode = (Paragraph) paragraphs.get(i);
                    System.out.println("Hello step 8 [" + i + "]");
                    if ( curNode != null ) {
                        String sCheck = "chapter";
                        //if (i == 503) {
                        // System.out.println("[" + curNode.getText().trim().toLowerCase() + "]");
                        // System.out.println(curNode.getText().trim().length());
                        // System.out.println("");
                        // System.out.println("[" + paragraphs.get(i + 1).getText().trim().toLowerCase() + "]");
                        // System.out.println("");
                        // System.out.println("[" + paragraphs.get(i - 1).getText().trim().toLowerCase() + "]" );
                        // System.out.println("********************");
                        // System.exit(0);
                        // };
                        if ( (i < maxKeys) && ( curNode.getText().trim().length() < 100 )
                                && ( curNode.getText().trim().toLowerCase().startsWith(sCheck) )
                                && ( paragraphs.get(i + 1).getText().trim().toLowerCase().startsWith(sCheck) == false)
                                && ( paragraphs.get(i - 1).getText().trim().toLowerCase().startsWith(sCheck) == false )
                        ) {
                            insertedBreak = true;
                            curNode.getParagraphFormat().setPageBreakBefore(true);
                            System.out.println("Hello step 8 [chapter] [" + i + "] True" );
                        } else {
                            curNode.getParagraphFormat().setPageBreakBefore(false);
                            System.out.println("Hello step 8 [chapter] [" + i + "] False" );
                        };
                    };
                };
            };

            System.out.println("paragraphs.getCount() = " + paragraphs.getCount() + " insertedBreak = " + insertedBreak);
            System.out.println("Hello step 9");
            if ( !insertedBreak ) {
                for ( int i=1; i<= paragraphs.getCount(); i += 500) {
                    Paragraph curNode = (Paragraph) paragraphs.get(i);
                    System.out.println("Hello step 9 [" + i + "]");
                    if ( curNode != null ) {
                        curNode.getParagraphFormat().setPageBreakBefore(true);
                        System.out.println("Hello step 9 [" + i + "] True" );
                    };
                };
            };

            //SaveFormat saveFormat = new SaveFormat().;

            HtmlSaveOptions saveOption = new HtmlSaveOptions(SaveFormat.EPUB);
            saveOption.setDocumentSplitCriteria(DocumentSplitCriteria.PAGE_BREAK);
            saveOption.setDocumentSplitHeadingLevel(9);
            saveOption.setSaveFormat(SaveFormat.EPUB);
            saveOption.setEncoding(Charset.forName("UTF-8"));

            System.out.println("Hello step 10");

            wordDoc.save(sPath + epubFilename, saveOption);
            System.out.println("Done. epub created.");

        } catch (Throwable t) {

            t.printStackTrace();
            System.exit(0);
        }
    }
}

tilal.ahmad · August 16, 2012, 1:48am

Hi there,

Sorry for the inconvenience faced. While using the latest version of Aspose.Words i.e. v11.6.0, I have managed to reproduce this issue on my side. I have logged the issue in our bug tracking system, the issue ID is WORDSNET-6763. I have linked your request to this issue and you will be notified as soon as it is resolved.
Moreover, could you please share a sample document; that’s working fine with PAGE_BREAK DocumentSplitCriteria?
Please feel free to contact us for any further assistance.
Best Regards,

aspose.notifier · October 6, 2012, 10:29pm

The issues you have found earlier (filed as WORDSNET-6763) have been fixed in this .NET update and this Java update.

This message was posted using Notification2Forum from Downloads module by aspose.notifier.

wrood · October 8, 2012, 5:44pm

Thanks so much. This resolves my issues from converting from a DOC file. It appears the same issue still exists for conversion from an RTF file though. Was the change only relevant to DOC files or is it just coincidence that I see it occurring for RTF and not DOC now?

awais.hafeez · October 9, 2012, 11:55am

Hi William,

Thanks for your inquiry. Yes, the change was relevant to that particular DOC file. Please attach your RTF document here for testing. I will investigate the issue on my side and provide you more information.

Best Regards,

awais.hafeez · October 11, 2012, 10:52am

Hi,

Thanks for sharing your document via email with us. I was unable to reproduce this exception using Aspose.Words v11.8.0 on my side. Could you please also share the code to be able to reproduce the same issue on my side.

Best Regards,

awais.hafeez · October 12, 2012, 4:59am

Hi,

Thanks for sharing your code via email with me. I have tested the scenario and have managed to reproduce the same exception on my side. For the sake of correction, I have logged this problem as WORDSNET-7071 in our issue tracking system. We will further look into the details of this problem and will keep you updated on the status of correction. We apologize for your inconvenience.

Best Regards,

aspose.notifier · November 2, 2012, 11:13pm

The issues you have found earlier (filed as WORDSNET-7071) have been fixed in this .NET update and this Java update.

This message was posted using Notification2Forum from Downloads module by aspose.notifier.