Free Support Forum - aspose.com

Scale Images, Show Section Numbers during DOC to HTML Conversion using Java API

Hi,

I am currently evaluating Aspose.word as our standard doc to html conversion application. I am using the latest evaluation version (Java) for the testing.

I am attaching a doc and the link of my converted document. Here is the list of issues we faced

1. Section numbers (3.1, 3.1.1) don’t show up
2. Images are not scaled properly and the border around image don’t come up properly for the first image.

Any help in figuring out what we are doing wrong is greatly appreciated.

Here is the link to the converted document

https://70.91.117.53/hylite/webpage/out-doc/ItalianCharacters.doc/ItalianCharacters.doc.html

Thanks
Kamal

Hello!<?xml:namespace prefix = o ns = "urn:schemas-microsoft-com:office:office" />

Thank you for your inquiry.

I have reproduced both of these issues.

1. Multilevel list numbering is output to HTML with some restrictions. It is impossible to set 1.2.3 as list labels natively in HTML. MS Word exports such lists as non-lists using simple paragraphs and spans. This is not good because such approach looses document structure. Converting to HTML we try to produce output closer to what we get from MS Word. But some MS Word features don’t map directly to “non-native” formats (HTML, PDF). This is a known issue #3701. We have some ideas to improve this but not in the nearest future. As a workaround you can use simple text with tab after it in place of list labels.

2. Issue with images is also known. They are put into floating canvases. But floating shapes are not well supported in our HTML conversion. This was logged as #4488. As a workaround you can place images inline or if you need more complex layout put them in tables with invisible borders. I see that boxes below the images are their captions. This could be achieved with table-driven design.

Please let me know if workarounds are suitable for you. I can also help you refactor this document showing what I’m suggesting.

Regards,

Hello,

Though both of these issues may not be a show stopper for us, but we would love to see some kind of resolution in Aspose. The biggest issue with our documents is that we don’t control them and our product will OEM Aspose and hence we cannot (do not want to) put restrictions on our clients on the type of documents.

We would continue evaluating the product and is there a document/page where we can find out all known issues with word to html conversions. This might help set some guidelines for our clients.

Thank you for clarification. I see the problem. If you don’t control input documents we cannot fully avoid issues with them. Here is a spreadsheet showing level of import and export to HTML and PDF (link below). But it states only that something is supported or partially supported with minimum details. We plan to improve this part of documentation in the future. The better way is experimenting with real documents and discussing particular differences here in the forum.<?xml:namespace prefix = o ns = "urn:schemas-microsoft-com:office:office" />

http://www.aspose.com/community/files/51/file-format-components/aspose.words/entry108980.aspx

We maintain defect database internally. These numbers are the issue identifiers. We usually provide information on particular issues but we cannot grant access to this database. Some information is internal there or lacks user friendly manner. It would be difficult to maintain publicly accessible database. So it’s better to ask support people to get relevant information.

#3701 has high priority (major) since we have collected many requests on these list labels. Currently Aspose.Words relies on HTML ordered lists and only the last level number appears in list items. So “1.2.3.” becomes simply “<?xml:namespace prefix = st1 ns = "urn:schemas-microsoft-com:office:smarttags" />3.” Hopefully we’ll get right with it up to the end of 2008. The idea is outputting list another way or parameterization of these approaches. All this should be also properly round-tripped (DOC->HTML->DOC).

#4488 is less priority issue (minor). Floating contents are not typical for HTML though they can be utilized. But it seemingly requires much effort. So I cannot expect any timeframes even for myself.

If you plan using Aspose.Words in OEM then we can also help with issues/documents coming from your customers. We consider all reported cases and think about improvements.

Best regards,

Any update on issue #3701? It doesn't seem to be fixed in 6.0 release. Any ETA?

Thanks,

-Bin

Hi<?xml:namespace prefix = o ns = "urn:schemas-microsoft-com:office:office" />

Thanks for your request. Unfortunately this issue is still unresolved. Currently I can’t provide you any estimate.

As a workaround you can try replace list numbering with plain text.

Best regards.

Curretly, issue #3701 is a showstopper for us to adopt Aspose.Word in a broader user base. What could we do to make it a higher priority for you guys to fix it?

As an alternative, would you at least provide me a solution so that I can fix it myself while waiting for your fix? I won't mind to do a post processing to fix the numbering in HTML.

Thanks,

-Bin

Hi<?xml:namespace prefix = o ns = "urn:schemas-microsoft-com:office:office" />

Thanks for your request. I will try to create code to workaround this problem. Please expect a reply before Monday.

Best regards.

Hi<?xml:namespace prefix = o ns = "urn:schemas-microsoft-com:office:office" />

Could you please attach more documents that you should convert to HTML for testing?

Best regards.

Can you try the attached document? The 2nd level numbering should look like the following

1.1 Line 2

1.2 Line 3

But after conversion, they became,

1. Line 2

2. Line 3

Thanks,

-Bin

Hi<?xml:namespace prefix = o ns = "urn:schemas-microsoft-com:office:office" />

Thank you for more information. Please see the following code:

public void Test023()

{

//Open document

Document doc = new Document(@"Test023\TestNumbering.doc");

//Replace list lables with plain text

ReplaceListLabels(doc);

//Save output document

doc.Save(@"Test023\out.html");

}

private void ReplaceListLabels(Document doc)

{

//Get collection of Paragraphs from the document

NodeCollection paragraphs = doc.GetChildNodes(NodeType.Paragraph, true);

ListLabelsExtractor extractor = new ListLabelsExtractor(doc.Lists);

//Loop through all paragraphs

foreach (Paragraph par in paragraphs)

{

if (par.IsListItem)

{

//Get label of list item

string label = extractor.LabelLists[par.ListFormat.List].GetListLabel(par.ListFormat.ListLevelNumber) + "\t";

//Create run that will represent label in the document

Run labelRun = new Run(doc, label);

//We should import paragraph indents

par.ParagraphFormat.LeftIndent = par.ListFormat.ListLevel.TextPosition;

par.ParagraphFormat.FirstLineIndent = par.ListFormat.ListLevel.NumberPosition;

//Remove list label

par.ListFormat.RemoveNumbers();

//Insert label at the begining of paragraph

par.ChildNodes.Insert(0, labelRun);

}

}

}

ListLabelExtractor class is attached.

List labels come correctly in HTML but there still are problems with indents.

Hope this could be useful for you.

Best regards.

Thanks for the solution. Will it also work with the bullet numbers generated by table of content. I'm a little concerned if the table of content links are still going to work if I change the target bullet numbers to text.

Thanks,

-Bin

Hi<?xml:namespace prefix = o ns = "urn:schemas-microsoft-com:office:office" />

Thanks for your request. It would be great if you attach such document for testing. I will test this document on my side and maybe modify the code if it will be needed or provide you more information.

Best regards.

Can you test the attached document?

Also, I was going to try your code, but I found the ListLableExtractor class was not attached.

Thanks,

-Bin

I found the ListLableExtractor class after I signed in. I tested your code on a simple doc. It worked ok. Then I tested it on a little bit complicated doc. It failed. The generated bullet numbering didn't get reset. You can try it using the attached doc (testtoc2.doc).

Thanks,

-Bin

Hi<?xml:namespace prefix = o ns = "urn:schemas-microsoft-com:office:office" />

Thank you for additional information. I modified the code. Please try using the following:

public void Test023()

{

//Open document

Document doc = new Document(@"Test023\number_error.doc");

//Replace list labels with plain text

ReplaceListLabels(doc);

//Save output document

doc.Save(@"Test023\out.html");

}

private void ReplaceListLabels(Document doc)

{

//Get collection of Paragraphs from the document

NodeCollection paragraphs = doc.GetChildNodes(NodeType.Paragraph, true);

//Loop through all paragraphs

foreach (Paragraph par in paragraphs)

{

if (par.IsListItem)

{

ListLabelsExtractor extractor = ListLabelsExtractor.GetLabelExtractor(par.ListFormat.List);

//Get label of list item

string label = extractor.GetListLabel(par.ListFormat.ListLevelNumber) + "\t";

//Create run that will represent label in the document

Run labelRun = new Run(doc, label);

//We should import paragraph indents

par.ParagraphFormat.LeftIndent = par.ListFormat.ListLevel.TextPosition;

par.ParagraphFormat.FirstLineIndent = par.ListFormat.ListLevel.NumberPosition;

Console.WriteLine(label + "\t" + par.ToTxt());

//Remove list label

par.ListFormat.RemoveNumbers();

//Insert label at the begining of paragraph

par.ChildNodes.Insert(0, labelRun);

}

}

}

ListLabelExtractor class is attached. Please let me know in case of any issues.

Best regards.

It's getting better. But I still found a problem. It incresed the numbering twice for some headings. Can you try the attached doc?

Thanks,

-Bin

Hi<?xml:namespace prefix = o ns = "urn:schemas-microsoft-com:office:office" />

Thank you for additional information. I spend some time and now code works correct with all your documents and with some other test documents. Please check it on your side and let me know in case of any issues.

Best regards.

I tried the new ListLabelsExtractor class. But I got the same result. Do I need to change anything in ReplaceListLabels? I attached the generated HTML file from testtoc3.doc.

Thanks,

-Bin

Hi<?xml:namespace prefix = o ns = "urn:schemas-microsoft-com:office:office" />

Sorry. I missed this. Please try this code:

private void ReplaceListLabels(Document doc)

{

//Get collection of Paragraphs from the document

NodeCollection paragraphs = doc.GetChildNodes(NodeType.Paragraph, true);

//Loop through all paragraphs

foreach (Paragraph par in paragraphs)

{

if (par.IsListItem && par.HasChildNodes)

{

ListLabelsExtractor extractor = ListLabelsExtractor.GetLabelExtractor(par.ListFormat.List);

//Get label of list item

string label = extractor.GetListLabel(par.ListFormat.ListLevelNumber) + "\t";

//Create run that will represent label in the document

Run labelRun = new Run(doc, label);

//We should import paragraph indents

par.ParagraphFormat.LeftIndent = par.ListFormat.ListLevel.TextPosition;

par.ParagraphFormat.FirstLineIndent = par.ListFormat.ListLevel.NumberPosition;

Console.WriteLine(label + "\t" + par.ToTxt());

//Remove list label

par.ListFormat.RemoveNumbers();

//Insert label at the begining of paragraph

par.ChildNodes.Insert(0, labelRun);

}

}

}

Best regards.