Creating ePUB with Aspose.Words Express 1.1

I’m currently evaluating Aspose.Words Express 1.1 as part of an ebook publication workflow, converting DOCX to ePUB. Let me first congratulate you to your fine piece of software. It is one of the few converters to produce XHMTL-compliant ePUB files. Most other converter’s don’t do that, and therefore their output does not validate.
I’ve stumbled upon a few minor issues though, that I’d like to share with you.
a) As part of the conversion process, Aspose.Words Express creates a table of contents, derived from the document structure. Additionally, it (sometimes?) creates an extra TOC element pointing to the first page, using the DOCX’s metadata title. When the DOCX’s metadata title contains non-ASCII characters, they are ommitted when creating that TOC entry. Non-ASCII characters in all other TOC entries are fine.
b) Some of the ePUBs created were missing the and metatags, while others have them. I have yet to find a pattern when that happens.
c) ePUB htmls should be split into multiple parts whenever possible, because this greatly improves the percieved speed of eReaders. This does, however create “page breaks” in most eBook readers, so it has an effect on the layout. Is there any chance for the ePUB conversion process to use the DOCX’s paragraph format info and manual page-breaks to determine where a split (and thus: page-break) would reflect the layout of the DOCX?
d) Some of the XHMTL files contain useless “…” tags. I have not found a pattern when those appear. It’s not a big deal, and they do not have any effect on the result, but I’m still wondering why those exist.
e) Is there any chance for Aspose.Words Express to save its settings?
Thank you for offering this amazing converter for free. Even with those issues it is far better than anything else I’ve tested.

Hi
Thank you for your interest in Aspose.Words Express.
a) Could you please attach a sample document that will allow us to reproduce the problem? We will check the issue and provide you more information.
b) It also would be great if you attach sample documents. We will check what is the difference and let you know how to resolve.
c) We will consider to provide an option to specify such split criteria in one of future versions. We will let you know once this option is available.
d) This can occur because text in your document consists of multiple Runs. Usually this occurs when you edit document multiple times in MS Word. We will consider joining runs with the same formatting before converting to EPUB.
e) Thank you for your suggestion. We will consider adding such feature in one of future versions.
Best regards,

Thanks for your prompt response.
a) Test case attached (test1.docx and test1.epub, which I renamed to test1.zip, because this forum wouldn’t let me upload epubs), notice the missing “Ü” in the first TOC entry.
b) I’ve experienced it multiple times, but have yet to come up with a reproducible scenario. Sometimes it works, and then it doesn’t. When it doesn’t work, both and are missing. When it does work, both are present.
c) Thanks.
d) Is there any way I can (manually) join those runs in MS Word prior to saving the DOCX?
e) Again, thanks.

Hi there,
Thanks for your feedback.
a) I managed to reproduce the issue on my side. I have linked your request to the appropriate issue. I also noticed that the last two headings in the TOC did include “U” but not the umlaut. I have logged this as another issue.
b) I’m afraid I can’t reproduce this issue on my side either, however it does sound very interesting. Please let us know if you are able to reproduce it so we can take a further look into it.
d) I’m afraid there is no easy way to join runs in MS Word. I will have this issue fixed in the next version of Express instead (This issue was actually already logged in our database for implementation).
I will try to implement all of your requests in the next version of Express. The next release being at the beginning of October.
However the bug with the characters in the TOC will need to be fixed in Aspose.Words so I’m afraid the fix to this issue may not make it into the next release.
If we can help with anything else, please feel free to ask.
Thanks,

Hi there,
Regarding the second issue I found, the one with the missing umlauts. I have taken a closer look into this as the chance that the character was converted to “A” was slim. The internal files contain the correct characters, therefore the issue must be related to the way my EPUB viewer (ADE) displayed the document and not a bug. The first issue is still a bug and will be fixed as soon as possible.
Thanks,

Hi Adam.
I’m not sure how your last response relates to the bug I reported. I may have been unclear in what the bug is, exactly, so I’ll try to be more specific.
Please look at the DOCX and the ZIP I uploaded:
The DOCX has a meta-title and three levels of headers. All four have an Umlaut: Ü (that is Uuml in HTML).
Now look at the NCX inside the ZIP, it has a total of four navPoint entities, the first reflects the meta-title, the other three reflect the three headings. The first navLabel is missing the “Ü”, the other three navLabels are fine. Please compare the docTitle with the first navPoint’s navLabel to see that the first character is missing.
Just to make sure, what I’m NOT talking about: When the DOCX is converted, the meta-title is used as base for the filenames inside the ePUB-ZIP. For this purpose, all non-ascii characters are stripped, so the files inside the ePUB start with the word “ber”, instead of the original “Über”. I assume that is for compatibility’s sake, and that’s fine.
I’m not sure where you saw capital "U"s (without the Umlaut) in the TOC, there is not a single such character in my entire NCX. (yes, there is one, but that’s part of the DOCTYPE) There are a total of four “Ü” (capital U umlaut) in the file, one in the docTitle, and three in the properly converted navLabels.
Also I never said any character was mangled, or converted to anything else. The one I’m talking about is simply missing.
I’m using Sigil 0.4.1 to check the contents of the ePUB, and it displays the navPoints exactly as they are in the NCX, so missing one “Ü”. I haven’t tried, but I assume the bug to be visible in any eReader, as it already exists in the XML. If ADE displays "A"s, "U"s or any other nonsense, I guess it’s not capable to display all UTF-8 content properly, but that would then be an entirely different problem. I do not have ADE installed at the moment, so I cannot describe how it would handle the file.

I’ve just run into an additional little bug.
Aspose.Words Express 1.1 currently exports this as part of the metadata in the OPF:
Aspose
In short, every ePUB produced will have the same identifier: “Aspose”. According to the ePUB specs, the OPF file has to contain a unique identifier. The term “unique” in this case is meant across books, so every ePUB created should have a different identifier. (Source: http://idpf.org/epub/20/spec/OPF_2.0.1_draft.htm#Section2.1)
One way to achieve this would be to include the ISBN as unique identifier, but since that isn’t always available, I’d suggest to create a UUID for that purpose, as outlined here: http://www.ibm.com/developerworks/xml/tutorials/x-epubtut/section3.html result would look similar to this:
urn:uuid:0cc33cbd-94e2-49c1-909a-72ae16bc2658
By the way, should I keep posting issues like this in this thread or should I create a new thread for each? Creating new threads feels a bit like spamming, so I stuck to this thread this time.

Hi there,
Thanks for your inquiry.
Yes we are talking about the same bug - the missing character from the first navigation heading. The second issue I brought up was what I thought was a bug, but upon further investigation I found the source files were correct and it was just ADE that was not displaying things properly. Please see the screenshot below.
https://forum.aspose.com/c/words/8
Regarding the unique identifier, I see what you mean. Thank you for reporting this to us. I have logged a request for this as well and passed on the details. Please feel free to keep posting issues in this thread if it suits you.
Thanks,

I just installed ADE, and can confirm your observation. This appears to be an ADE bug, but if it is, it’s a rather weird one: I tried creating a similar DOCX, this time with lots of Non-ASCII characters, including cyrillic and thai characters. The layout was similar, with the same characters repeated as title, h1, h2 and h3. Converting that document to ePUB and displaying it in ADE showed all of them unaltered.
I’ve come accross another limitation / bug in Aspose.Words Express 1.1:
If you add multiple authors to the DOCX’s metadata, seperating them by semicolon (not sure if all localized Office versions want a semicolon there), they are converted into a single tag:
John Doe; Jane Doe
According to the Dublin Core standard, there should be multiple separate tags, like this:
John Doe
Jane Doe
Additionally, I’d like to comment on the DOCX’s “company” meta information, which is converted to the “publisher” tag. While this may be okay, please note that Word has a separate metadata field for the publisher. It may be more appropriate to use that instead, although it’s hidden rather deeply in Word’s user interface.

Hi Jörn,
Thank you for your suggestions. I have logged them into our defect database. We will consider making these improvements in one of future versions of Aspose.Words.
Best regards,

The issues you have found earlier (filed as WORDSNET-4815) have been fixed in this .NET update and in this Java update.

This message was posted using Notification2Forum from Downloads module by aspose.notifier.
(11)

The issues you have found earlier (filed as WORDSNET-5217;WORDSNET-5237) have been fixed in this .NET update and in this Java update.

This message was posted using Notification2Forum from Downloads module by aspose.notifier.

I see in this thread that an issue was logged regarding unique identifiers for created ePUBs. While I’m glad that issue was resolved, unfortunately, I believe it was fixed in a way that makes it very difficult for us to use this library.

We very much prefer to be able to specify what the unique identifier for the ePUB should be, rather than having it generated by the library. In the absence of one, it makes sense to generate a default, but we should have a way to specify the id for the document.

One alternative would be to provide a way to retrieve the identifier that was generated, perhaps as a return value from Document.Save. However, allowing us to provide an id to use is definitely preferable.

We are currently evaluating whether Aspose.Words will meet our needs. Please let me know if there is any way to specify the unique identifier or to easily retrieve it after creation. Thank you.

Hi Shawn,

Thanks for your query. Unfortunately, the requested feature is not available in Aspose.Words at the moment. However, I have logged this feature request as WORDSNET-6535 in our issue tracking system. You will be notified via this forum thread once this feature is available.

We apologize for your inconvenience.

Hi Shawn,

Thanks for your inquiry.

Just incase you didn’t see the updated message, we have taken another look into your request and we think it’s a good idea to implement such a feature in a future version.

When implemented we will allow you to specify the uid used upon ePUB export. We will keep you informed of any developments.

Thanks,

The issues you have found earlier (filed as WORDSNET-5212) have been fixed in this Aspose.Words for .NET 18.5 update and this Aspose.Words for Java 18.5 update.