Issues when converting from Word documents to PDF and HTML

ThomasKnight · March 31, 2008, 11:21pm

Hello all,

We are experiencing issues with file conversion when using Aspose.Words within .net C#.

The integration was relatively seemless and intial testing showed excellent results. However after deployment the client has noticed that many of their documents fail what we could call acceptable conversion standards.

When converting to PDF i have noted:
Images are over-run with text, as are text based tabled.
Table of contents pages are re-numbered to incorrect pages numbers.
Some images are missing completely.
And i am aware that the client is experiencing other issues as well, this is only what I have found within the sample document they have provided.

HTML coversions are mostly flawless with the exception of Table of Contents aligning to the right for certain sections, but otherwise it is acceptable.

My main query is this, does Aspose support a more direct conversion method that we are not using. Direct printing to PDF renders the results much better, this does not, however, integrate well with our requirements.

Kind Regards,

Thomas Knight - Lead Developer.

Sample Images (Converted version on right):
http://www.objectify.com.au/temp2/asposeSamples/2.jpg - Overrun field area
http://www.objectify.com.au/temp2/asposeSamples/3.jpg - Incorrect page numbers
http://www.objectify.com.au/temp2/asposeSamples/4.jpg - More incorrect page numbers in appendixes
http://www.objectify.com.au/temp2/asposeSamples/5.jpg - Overrun image
http://www.objectify.com.au/temp2/asposeSamples/HTML_TOC.jpg - HTML TOC sample (Conversion only, refer to 3.jpg for original)

Sample Code Snippet:

void init()
{
    License.InitLicenses();
}

…
…
…

public static byte[] ConvertToPDF(byte[] originalDoc)
{
    MemoryStream ms = new MemoryStream(originalDoc);

    Aspose.Words.Document doc = new Aspose.Words.Document(ms);

    MemoryStream stream = new MemoryStream();

    doc.SaveOptions.ExportImagesFolder = getTempFolder().FullName;

    doc.Save(stream, SaveFormat.AsposePdf);

    stream.Seek(0, SeekOrigin.Begin);

    XmlDocument xmlDoc = new XmlDocument();
    xmlDoc.Load(stream);

    Aspose.Pdf.Pdf pdf = new Aspose.Pdf.Pdf();

    pdf.IsImagesInXmlDeleteNeeded = true;
    pdf.BindXML(xmlDoc, null);

    pdf.IsTruetypeFontMapCached = false;

    byte[] output = pdf.GetBuffer();

    return output;
}

Klepus · April 1, 2008, 4:22am

Hello Thomas!
Thank you for your inquiry.
I see the issues in the linked raster files. But to investigate them we need source DOC files. Please attach them to the forum. You can attach multiple files to one post in an archive. ZIP or RAR could be good.
Conversions to PDF and HTML are not ideal. Many things in PDF are consequences of restrictions in Aspose.Pdf. They don’t support everything that could be in Word document. If you attach files I could suggest anything: either changes in Aspose components or workarounds.
Please let me know whether it is acceptable to change documents manually before conversion? If it is not, then we’ll try to find programmatic solutions instead of manual refactoring.
Aspose.Words currently supports conversion to PDF only in pair with Aspose.Pdf. Direct conversion will be available in the future but not tomorrow. We are developing rendering engine which will be able to paginate and layout documents to output them in print-aware formats. It is the main part of Aspose.Words.Viewer which is currently in beta.
Regards,

ThomasKnight · April 1, 2008, 6:34pm

Thanks for the speedy response.

Unfortunately document manual modification is not acceptable for the client. The system is a fully automatic integration between their existing DMS and our product. Their existing DMS contains far too many entries for manual conversion at this time.

The URL file contains the original document, the aspose converted document and a sample from The microsoft office 2007 convert to PDF tool.

Thanks again for your assistance.

Regards,
Thomas Knight.

Please fetch from http://www.objectify.com.au/temp2/asposeSamples/PDF.zip (18 Meg)

Klepus · April 2, 2008, 12:56pm

Hello!
Thank you for additional materials. I’ll investigate all the cases and provide you more info.
Regards,

Klepus · April 3, 2008, 11:34am

Hi Thomas!
First I should note that conversion to non-Word formats cannot be ideal principally. We have a map that shows what features are supported in PDF export and HTML import/export:
https://releases.aspose.com/words/net
Your document is really great! I’ll go from the original list of images you referenced in this thread. I found that all PDF issues need to be considered with Aspose.Pdf team and I will ask for their assistance. There are some details.

http://www.objectify.com.au/temp2/asposeSamples/2.jpg
Floating text box is assigned a wrapping mode that is not supported. This is known issue #1001. It needs collaboration with Aspose.Pdf team to be fixed. As I know they don’t plan to provide this improvement in considerable future. That’s disappointing for both you and me. I’ll ask them for comments.
As a workaround we would change alignment and wrap mode if we had the ability to change source documents. If we are disallowed to do so then we can try the same programmatically before the conversion. It is more difficult because the case should be maximally formalized and narrowed. Solving the problem in general couldn’t be the case using this approach. We have to know statistics of what we could cover and how frequent the problem is.
http://www.objectify.com.au/temp2/asposeSamples/3.jpg
Pages in the TOC are numbered incorrectly. We don’t output page numbers directly. REFPAGE macros are inserted instead. Aspose.Pdf is responsible for their expansion. I’ll consult with Aspose.Pdf team to provide you more information or fix the issue.
This is the beginning of the TOC in the intermediate Aspose.Pdf XML file I have generated. Here we can see that first TOC entry refers the page containing the paragraph marked with ID=“paraId_2”. That’s right. It’s the first paragraph in the section containing heading “SECTION ONE: CONTEXT AND BACKGROUND”. This page is numbered as “1” in the document since numbering restarts here.
Table of contents
SECTION ONE: CONTEXT AND BACKGROUND
#$TAB
#$REFPAGE(paraId_2)
http://www.objectify.com.au/temp2/asposeSamples/4.jpg
Wrong page numbering starting form the “APPENDIX 2: REGIONAL PLANS, STRATEGIES AND INVESTIGATIONS.” This TOC element properly refers to the paragraph marked with ID=“paraId_161”. This is the same problem as (2).
http://www.objectify.com.au/temp2/asposeSamples/5.jpg
The same as (1). This mode of text wrapping is not supported. I don’t see any workaround, sorry.
http://www.objectify.com.au/temp2/asposeSamples/HTML_TOC.jpg
Accidental right alignment of some TOC elements. I figured out that only first and second level elements are aligned improperly. Say “1.2.1” is okay but “SECTION ONE” and “1.1” are not.
This happens because the corresponding MS Word style named “TOC1” has right alignment. To see that you can open the document in MS Word, select option “Format → Show Formatting” from the menu and place the cursor on any of the lines that get right-aligned. In MS Word document layout is not spoiled because it treats tabs in TOC differently. Note that if you convert this document to HTML using MS Word you will get the same. I just have checked.
Here we can try a programmatic workaround. If you remove right alignment from the “TOC1” style the document will be converted without this issue. But I’m not sure this will be suitable for other documents. Since you are building an automatic system any patches in the code should be applied aware of preserving other documents.
If you are absolutely sure this doesn’t break anything you can do it like this before converting to HTML:

doc.Styles["TOC1"].ParagraphFormat.Alignment = ParagraphAlignment.Left;

When I converted your document to HTML I found other issues mainly occurring because of floating contents. So floating images are placed inline in the current implementation. That’s okay if this looks affordable for you.

Please attach here a document where images disapper at all and maybe some other issues shows up. I’ll investigate the issues and provide some feedback.
Best regards,

Hans.firefox · April 3, 2008, 1:55pm

Hi Thomas,

I am a member of Aspose Pdf team.
For problem 1 and 4: It is caused by a known issue that text wrapping modes for shapes are not supported in Aspose.Pdf at present.We have plan to support this feature in our future release. According to our investigation,it is so complicate that I am afraid it will take us about 6 months to support it. Sorry for inconvenience.
For other issues listed above, I can’t give you a immediate answer now. We will ask viktor to share corresponding materials so that we can investigate on them. Then I will come back to you and advice.
Thanks.

Best regards.