I am currently evaluating Aspose.word as our standard doc to html conversion application. I am using the latest evaluation version (Java) for the testing.
I am attaching a doc and the link of my converted document. Here is the list of issues we faced
Section numbers (3.1, 3.1.1) don’t show up
Images are not scaled properly and the border around image don’t come up properly for the first image.
Any help in figuring out what we are doing wrong is greatly appreciated.
Multilevel list numbering is output to HTML with some restrictions. It is impossible to set 1.2.3 as list labels natively in HTML. MS Word exports such lists as non-lists using simple paragraphs and spans. This is not good because such approach looses document structure. Converting to HTML we try to produce output closer to what we get from MS Word. But some MS Word features don’t map directly to “non-native” formats (HTML, PDF). This is a known issue #3701. We have some ideas to improve this but not in the nearest future. As a workaround you can use simple text with tab after it in place of list labels.
Issue with images is also known. They are put into floating canvases. But floating shapes are not well supported in our HTML conversion. This was logged as #4488. As a workaround you can place images inline or if you need more complex layout put them in tables with invisible borders. I see that boxes below the images are their captions. This could be achieved with table-driven design.
Please let me know if workarounds are suitable for you. I can also help you refactor this document showing what I’m suggesting.
Though both of these issues may not be a show stopper for us, but we would love to see some kind of resolution in Aspose. The biggest issue with our documents is that we don’t control them and our product will OEM Aspose and hence we cannot (do not want to) put restrictions on our clients on the type of documents.
We would continue evaluating the product and is there a document/page where we can find out all known issues with word to html conversions. This might help set some guidelines for our clients.
Thank you for clarification. I see the problem. If you don’t control input documents we cannot fully avoid issues with them. Here is a spreadsheet showing level of import and export to HTML and PDF (link below). But it states only that something is supported or partially supported with minimum details. We plan to improve this part of documentation in the future. The better way is experimenting with real documents and discussing particular differences here in the forum. https://downloads.aspose.com/words/net
We maintain defect database internally. These numbers are the issue identifiers. We usually provide information on particular issues but we cannot grant access to this database. Some information is internal there or lacks user friendly manner. It would be difficult to maintain publicly accessible database. So it’s better to ask support people to get relevant information.
#3701 has high priority (major) since we have collected many requests on these list labels. Currently Aspose.Words relies on HTML ordered lists and only the last level number appears in list items. So “1.2.3.” becomes simply “3.” Hopefully we’ll get right with it up to the end of 2008. The idea is outputting list another way or parameterization of these approaches. All this should be also properly round-tripped (DOC->HTML->DOC).
#4488 is less priority issue (minor). Floating contents are not typical for HTML though they can be utilized. But it seemingly requires much effort. So I cannot expect any timeframes even for myself.
If you plan using Aspose.Words in OEM then we can also help with issues/documents coming from your customers. We consider all reported cases and think about improvements.
Curretly, issue #3701 is a showstopper for us to adopt Aspose.Word in a broader user base. What could we do to make it a higher priority for you guys to fix it?
As an alternative, would you at least provide me a solution so that I can fix it myself while waiting for your fix? I won’t mind to do a post processing to fix the numbering in HTML.
Thank you for more information. Please see the following code:
private void ReplaceListLabels(Document doc)
{
//Get collection of Paragraphs from the document
NodeCollection paragraphs = doc.GetChildNodes(NodeType.Paragraph, true);
ListLabelsExtractor extractor = new ListLabelsExtractor(doc.Lists);
//Loop through all paragraphs
foreach (Paragraph par in paragraphs)
{
if (par.IsListItem)
{
//Get label of list item
string label = extractor.LabelLists[par.ListFormat.List].GetListLabel(par.ListFormat.ListLevelNumber) + "\t";
//Create run that will represent label in the document
Run labelRun = new Run(doc, label);
//We should import paragraph indents
par.ParagraphFormat.LeftIndent = par.ListFormat.ListLevel.TextPosition;
par.ParagraphFormat.FirstLineIndent = par.ListFormat.ListLevel.NumberPosition;
//Remove list label
par.ListFormat.RemoveNumbers();
//Insert label at the begining of paragraph
par.ChildNodes.Insert(0, labelRun);
}
}
}
ListLabelExtractor class is attached.
List labels come correctly in HTML but there still are problems with indents.
Thanks for the solution. Will it also work with the bullet numbers generated by table of content. I’m a little concerned if the table of content links are still going to work if I change the target bullet numbers to text.
Thanks for your request. It would be great if you attach such document for testing. I will test this document on my side and maybe modify the code if it will be needed or provide you more information.
I found the ListLableExtractor class after I signed in. I tested your code on a simple doc. It worked ok. Then I tested it on a little bit complicated doc. It failed. The generated bullet numbering didn’t get reset. You can try it using the attached doc (testtoc2.doc).
Thank you for additional information. I modified the code. Please try using the following:
private void ReplaceListLabels(Document doc)
{
//Get collection of Paragraphs from the document
NodeCollection paragraphs = doc.GetChildNodes(NodeType.Paragraph, true);
//Loop through all paragraphs
foreach (Paragraph par in paragraphs)
{
if (par.IsListItem)
{
ListLabelsExtractor extractor = ListLabelsExtractor.GetLabelExtractor(par.ListFormat.List);
//Get label of list item
string label = extractor.GetListLabel(par.ListFormat.ListLevelNumber) + "\t";
//Create run that will represent label in the document
Run labelRun = new Run(doc, label);
//We should import paragraph indents
par.ParagraphFormat.LeftIndent = par.ListFormat.ListLevel.TextPosition;
par.ParagraphFormat.FirstLineIndent = par.ListFormat.ListLevel.NumberPosition;
Console.WriteLine(label + "\t" + par.ToTxt());
//Remove list label
par.ListFormat.RemoveNumbers();
//Insert label at the begining of paragraph
par.ChildNodes.Insert(0, labelRun);
}
}
}
ListLabelExtractor class is attached. Please let me know in case of any issues.
Thank you for additional information. I spend some time and now code works correct with all your documents and with some other test documents. Please check it on your side and let me know in case of any issues.
I tried the new ListLabelsExtractor class. But I got the same result. Do I need to change anything in ReplaceListLabels? I attached the generated HTML file from testtoc3.doc.
private void ReplaceListLabels(Document doc)
{
//Get collection of Paragraphs from the document
NodeCollection paragraphs = doc.GetChildNodes(NodeType.Paragraph, true);
//Loop through all paragraphs
foreach (Paragraph par in paragraphs)
{
if (par.IsListItem && par.HasChildNodes)
{
ListLabelsExtractor extractor = ListLabelsExtractor.GetLabelExtractor(par.ListFormat.List);
//Get label of list item
string label = extractor.GetListLabel(par.ListFormat.ListLevelNumber) + "\t";
//Create run that will represent label in the document
Run labelRun = new Run(doc, label);
//We should import paragraph indents
par.ParagraphFormat.LeftIndent = par.ListFormat.ListLevel.TextPosition;
par.ParagraphFormat.FirstLineIndent = par.ListFormat.ListLevel.NumberPosition;
Console.WriteLine(label + "\t" + par.ToTxt());
//Remove list label
par.ListFormat.RemoveNumbers();
//Insert label at the begining of paragraph
par.ChildNodes.Insert(0, labelRun);
}
}
}