Scale Images, Show Section Numbers during DOC to HTML Conversion using Java API

mksatish · June 23, 2008, 12:56pm

Hi,

I am currently evaluating Aspose.word as our standard doc to html conversion application. I am using the latest evaluation version (Java) for the testing.

I am attaching a doc and the link of my converted document. Here is the list of issues we faced

Section numbers (3.1, 3.1.1) don’t show up
Images are not scaled properly and the border around image don’t come up properly for the first image.

Any help in figuring out what we are doing wrong is greatly appreciated.

Here is the link to the converted document

Thanks
Kamal

Klepus · June 24, 2008, 6:15am

Hello!

Thank you for your inquiry.

I have reproduced both of these issues.

Multilevel list numbering is output to HTML with some restrictions. It is impossible to set 1.2.3 as list labels natively in HTML. MS Word exports such lists as non-lists using simple paragraphs and spans. This is not good because such approach looses document structure. Converting to HTML we try to produce output closer to what we get from MS Word. But some MS Word features don’t map directly to “non-native” formats (HTML, PDF). This is a known issue #3701. We have some ideas to improve this but not in the nearest future. As a workaround you can use simple text with tab after it in place of list labels.
Issue with images is also known. They are put into floating canvases. But floating shapes are not well supported in our HTML conversion. This was logged as #4488. As a workaround you can place images inline or if you need more complex layout put them in tables with invisible borders. I see that boxes below the images are their captions. This could be achieved with table-driven design.

Please let me know if workarounds are suitable for you. I can also help you refactor this document showing what I’m suggesting.

Regards,

mksatish · June 24, 2008, 12:05pm

Hello,

Though both of these issues may not be a show stopper for us, but we would love to see some kind of resolution in Aspose. The biggest issue with our documents is that we don’t control them and our product will OEM Aspose and hence we cannot (do not want to) put restrictions on our clients on the type of documents.

We would continue evaluating the product and is there a document/page where we can find out all known issues with word to html conversions. This might help set some guidelines for our clients.

Klepus · June 24, 2008, 1:59pm

Thank you for clarification. I see the problem. If you don’t control input documents we cannot fully avoid issues with them. Here is a spreadsheet showing level of import and export to HTML and PDF (link below). But it states only that something is supported or partially supported with minimum details. We plan to improve this part of documentation in the future. The better way is experimenting with real documents and discussing particular differences here in the forum.
https://downloads.aspose.com/words/net

We maintain defect database internally. These numbers are the issue identifiers. We usually provide information on particular issues but we cannot grant access to this database. Some information is internal there or lacks user friendly manner. It would be difficult to maintain publicly accessible database. So it’s better to ask support people to get relevant information.

#3701 has high priority (major) since we have collected many requests on these list labels. Currently Aspose.Words relies on HTML ordered lists and only the last level number appears in list items. So “1.2.3.” becomes simply “3.” Hopefully we’ll get right with it up to the end of 2008. The idea is outputting list another way or parameterization of these approaches. All this should be also properly round-tripped (DOC->HTML->DOC).

#4488 is less priority issue (minor). Floating contents are not typical for HTML though they can be utilized. But it seemingly requires much effort. So I cannot expect any timeframes even for myself.

If you plan using Aspose.Words in OEM then we can also help with issues/documents coming from your customers. We consider all reported cases and think about improvements.

Best regards,

datahsd · December 2, 2008, 1:06pm

Any update on issue #3701? It doesn’t seem to be fixed in 6.0 release. Any ETA?

Thanks,

-Bin

alexey.noskov · December 2, 2008, 3:02pm

Hi

Thanks for your request. Unfortunately this issue is still unresolved. Currently I can’t provide you any estimate.

As a workaround you can try replace list numbering with plain text.

Best regards.

datahsd · December 5, 2008, 1:18pm

Curretly, issue #3701 is a showstopper for us to adopt Aspose.Word in a broader user base. What could we do to make it a higher priority for you guys to fix it?

As an alternative, would you at least provide me a solution so that I can fix it myself while waiting for your fix? I won’t mind to do a post processing to fix the numbering in HTML.

Thanks,

-Bin

alexey.noskov · December 5, 2008, 3:57pm

Hi

Thanks for your request. I will try to create code to workaround this problem. Please expect a reply before Monday.

Best regards.

alexey.noskov · December 8, 2008, 6:58am

Hi

Could you please attach more documents that you should convert to HTML for testing?

Best regards.

datahsd · December 8, 2008, 4:12pm

Can you try the attached document? The 2nd level numbering should look like the following

1.1 Line 2
1.2 Line 3

But after conversion, they became,

1. Line 2
2. Line 3

Thanks,

-Bin

alexey.noskov · December 9, 2008, 3:22am

Hi

Thank you for more information. Please see the following code:

private void ReplaceListLabels(Document doc)
{
    //Get collection of Paragraphs from the document
    NodeCollection paragraphs = doc.GetChildNodes(NodeType.Paragraph, true);
    ListLabelsExtractor extractor = new ListLabelsExtractor(doc.Lists);

    //Loop through all paragraphs
    foreach (Paragraph par in paragraphs)
    {
        if (par.IsListItem)
        {
            //Get label of list item
            string label = extractor.LabelLists[par.ListFormat.List].GetListLabel(par.ListFormat.ListLevelNumber) + "\t";
            //Create run that will represent label in the document
            Run labelRun = new Run(doc, label);

            //We should import paragraph indents
            par.ParagraphFormat.LeftIndent = par.ListFormat.ListLevel.TextPosition;
            par.ParagraphFormat.FirstLineIndent = par.ListFormat.ListLevel.NumberPosition;

            //Remove list label
            par.ListFormat.RemoveNumbers();
            //Insert label at the begining of paragraph
            par.ChildNodes.Insert(0, labelRun);
        }
    }
}

ListLabelExtractor class is attached.

List labels come correctly in HTML but there still are problems with indents.

Hope this could be useful for you.

Best regards.

datahsd · December 11, 2008, 2:39pm

Thanks for the solution. Will it also work with the bullet numbers generated by table of content. I’m a little concerned if the table of content links are still going to work if I change the target bullet numbers to text.

Thanks,

-Bin

alexey.noskov · December 11, 2008, 3:21pm

Hi

Thanks for your request. It would be great if you attach such document for testing. I will test this document on my side and maybe modify the code if it will be needed or provide you more information.

Best regards.

datahsd · December 15, 2008, 12:39am

Can you test the attached document?

Also, I was going to try your code, but I found the ListLableExtractor class was not attached.

Thanks,

-Bin

datahsd · December 15, 2008, 1:14am

I found the ListLableExtractor class after I signed in. I tested your code on a simple doc. It worked ok. Then I tested it on a little bit complicated doc. It failed. The generated bullet numbering didn’t get reset. You can try it using the attached doc (testtoc2.doc).

Thanks,

-Bin

alexey.noskov · December 15, 2008, 5:56am

Hi

Thank you for additional information. I modified the code. Please try using the following:

private void ReplaceListLabels(Document doc)
{
    //Get collection of Paragraphs from the document
    NodeCollection paragraphs = doc.GetChildNodes(NodeType.Paragraph, true);

    //Loop through all paragraphs
    foreach (Paragraph par in paragraphs)
    {
        if (par.IsListItem)
        {
            ListLabelsExtractor extractor = ListLabelsExtractor.GetLabelExtractor(par.ListFormat.List);
            //Get label of list item
            string label = extractor.GetListLabel(par.ListFormat.ListLevelNumber) + "\t";
            //Create run that will represent label in the document
            Run labelRun = new Run(doc, label);

            //We should import paragraph indents
            par.ParagraphFormat.LeftIndent = par.ListFormat.ListLevel.TextPosition;
            par.ParagraphFormat.FirstLineIndent = par.ListFormat.ListLevel.NumberPosition;

            Console.WriteLine(label + "\t" + par.ToTxt());

            //Remove list label
            par.ListFormat.RemoveNumbers();
            //Insert label at the begining of paragraph
            par.ChildNodes.Insert(0, labelRun);
        }
    }
}

ListLabelExtractor class is attached. Please let me know in case of any issues.

Best regards.

datahsd · December 15, 2008, 12:42pm

It’s getting better. But I still found a problem. It incresed the numbering twice for some headings. Can you try the attached doc?

Thanks,

-Bin

alexey.noskov · December 16, 2008, 7:34am

Hi

Thank you for additional information. I spend some time and now code works correct with all your documents and with some other test documents. Please check it on your side and let me know in case of any issues.

Best regards.

datahsd · December 16, 2008, 12:25pm

I tried the new ListLabelsExtractor class. But I got the same result. Do I need to change anything in ReplaceListLabels? I attached the generated HTML file from testtoc3.doc.

Thanks,

-Bin

alexey.noskov · December 16, 2008, 2:21pm

Hi

Sorry. I missed this. Please try this code:

private void ReplaceListLabels(Document doc)
{
    //Get collection of Paragraphs from the document
    NodeCollection paragraphs = doc.GetChildNodes(NodeType.Paragraph, true);

    //Loop through all paragraphs
    foreach (Paragraph par in paragraphs)
    {
        if (par.IsListItem && par.HasChildNodes)
        {
            ListLabelsExtractor extractor = ListLabelsExtractor.GetLabelExtractor(par.ListFormat.List);
            //Get label of list item
            string label = extractor.GetListLabel(par.ListFormat.ListLevelNumber) + "\t";
            //Create run that will represent label in the document
            Run labelRun = new Run(doc, label);

            //We should import paragraph indents
            par.ParagraphFormat.LeftIndent = par.ListFormat.ListLevel.TextPosition;
            par.ParagraphFormat.FirstLineIndent = par.ListFormat.ListLevel.NumberPosition;

            Console.WriteLine(label + "\t" + par.ToTxt());

            //Remove list label
            par.ListFormat.RemoveNumbers();
            //Insert label at the begining of paragraph
            par.ChildNodes.Insert(0, labelRun);
        }
    }
}

Best regards.