I just converted 5,000 Word doc files into ePubs using Aspose and for the most part the results were great, but for a few hundred files I had the XMLStreamException when executing doc.save. I found that this happens when I use DocumentSplitCriteria=PAGE_BREAK but when I switch to HEADING_PARAGRAPH the issue does not occur. In order for our eBooks to open quickly, PAGE_BREAK seems to work better (and is more consistent with chapter breaks). As you can see most of the code revolves around trying to be sure there is a reasonable amount of breaks in the document.
I run the conversions in ColdFusion. I would be happy to provide source documents, but I cannot share them publicly. If you can provide me a way to send them (email?) then I will happily send examples.
Here is the part of my code that uses Aspose. (TitleData is query data.)
<cfset LicFile=CreateObject("java", "java.io.FileInputStream").init(JavaCast("string","/myhouse/java/aspose/words/Aspose.Total.Java.lic"))>
<cfset CreateObject("java", "com.aspose.words.License").setLicense(LicFile)>
<cfset doc=CreateObject("java", "com.aspose.words.Document").init(arguments.sourceDoc)>
<cfset paragraphs=doc.getChildNodes(CreateObject("java", "com.aspose.words.NodeType").PARAGRAPH, true)>
<cfset ControlChar=CreateObject("java", "com.aspose.words.ControlChar").page_break>
<cfif isDefined("curNode")>
<cfif curNode.getParagraphFormat().getPageBreakBefore()>
<cfif isDefined("curRun") and curRun.Text.Contains(ControlChar)>
<cfif totalOriginalPageBreaks gt (maxKeys/500)>
<cfif isDefined("curNode")>
<!—
check the previous and next node to be reasonably sure we’re not within the table of contents
also, hopefully if a normal paragraph begins with the word chapter, hopefully it is less than 100 chars
—>
<cfif
curkey lt maxKeys
AND len(trim(curNode.getText())) lt 100
AND compareNoCase(left(trim(curNode.getText()), len("chapter")), "chapter") is 0
AND compareNoCase(left(trim(paragraphs.get(curkey+1).getText()), len("chapter")), "chapter") neq 0
AND compareNoCase(left(trim(paragraphs.get(curkey-1).getText()), len("chapter")), "chapter") neq 0
\>
<cfset curNode.getParagraphFormat().setPageBreakBefore(true)>
<cfset curNode.getParagraphFormat().setPageBreakBefore(false)>
<cfif isDefined("curNode")>
<cfset curNode.getParagraphFormat().setPageBreakBefore(true)>
<cfset SaveFormat=CreateObject("java", "com.aspose.words.SaveFormat")>
<cfset saveOptions=CreateObject("java", "com.aspose.words.HtmlSaveOptions").init(SaveFormat.ePub)>
<cfset saveOptions.setDocumentSplitCriteria(CreateObject("java", "com.aspose.words.DocumentSplitCriteria").PAGE_BREAK)>
<cfset saveOptions.setDocumentSplitHeadingLevel(9)>
<cfset saveOptions.setSaveFormat(SaveFormat.EPUB)>
<cfset saveOptions.setEncoding(CreateObject("java", "java.nio.charset.Charset").forName("UTF-8"))>
<cfset doc.save(arguments.ePubPath & "/" & arguments.ePubFileName, saveOptions)>
As a side note, I was looking at Calibre and it seems to split the documents based on byte count instead of the types of criteria that Aspose provides. It would be useful if Aspose allowed you to set a byte count and just broke the files, but did not assign them as chapters (change how the NCX and OPF is written).
Thanks for your request. Could you please share sample java code along with sample document? We will test it at our end and suggest you accordingly.
Although, sharing documents over forum is also safe as only the owner and the Aspose staff has access to these. But you can also share the documents via email using contact tab in the post.
Please feel free to contact us for any further assistance.
Best Regards,
Thanks I have received the sample documents, attached in your email. I would appreciate if you please provide java code sample for investigating the issue at our side.
Sorry for the inconvenience faced. While using the latest version of Aspose.Words i.e. v11.6.0, I have managed to reproduce this issue on my side. I have logged the issue in our bug tracking system, the issue ID is WORDSNET-6763. I have linked your request to this issue and you will be notified as soon as it is resolved.
Moreover, could you please share a sample document; that’s working fine with PAGE_BREAK DocumentSplitCriteria?
Please feel free to contact us for any further assistance.
Best Regards,
Thanks so much. This resolves my issues from converting from a DOC file. It appears the same issue still exists for conversion from an RTF file though. Was the change only relevant to DOC files or is it just coincidence that I see it occurring for RTF and not DOC now?
Thanks for your inquiry. Yes, the change was relevant to that particular DOC file. Please attach your RTF document here for testing. I will investigate the issue on my side and provide you more information.
Thanks for sharing your document via email with us. I was unable to reproduce this exception using Aspose.Words v11.8.0 on my side. Could you please also share the code to be able to reproduce the same issue on my side.
Thanks for sharing your code via email with me. I have tested the scenario and have managed to reproduce the same exception on my side. For the sake of correction, I have logged this problem as WORDSNET-7071 in our issue tracking system. We will further look into the details of this problem and will keep you updated on the status of correction. We apologize for your inconvenience.