To ODT conversion's format failures

PjCouldBe · May 6, 2016, 3:10am

Good day!

I’m using Aspose.Words (version 16.3.0) and Aspose.PDF (version 11.4.0) java libraries. I need to convert documents from different formats (particularly from DOCX, TXT and PDF) to ODT format. However some convertion results contain several format and appearance issues.

The most important issue is that numeration and bullet markers are shifted in arbitrary way and sometimes extra spaces are added. So I attached some source document (in source.zip) and appropriate convertion results (results.zip) to demonstrate problems.

So, can I solve these issues by SaveOptions instance settings or something else? Or is it a bug?

Here the code snippets how do I convert from .docx to .odt:

protected Document to(Document src, int saveFormat)
{
    SaveOptions opts = getSaveOptions(saveFormat);
    byte[] bytes = new byte[0];
    ILogger logger = LoggerSystem.getLogger(this.getClass().getName());
    try (ByteArrayOutputStream bos = new ByteArrayOutputStream()) {
        src.save(bos, opts);
        bytes = bos.toByteArray();
    } catch (Exception e)
    {
        logger.warn("Fail converting document because of:\n" + ExceptionUtils.getFullStackTrace(e));
        return null;
    }
    try (ByteArrayInputStream bis = new ByteArrayInputStream(bytes)) {
        return new com.aspose.words.Document(bis);
    } catch (Exception e)
    {
        logger.warn("Fail converting document because of:\n" + ExceptionUtils.getFullStackTrace(e));
        return null;
    }
}

private SaveOptions getSaveOptions(int saveFormat)
{
    if (saveFormat < 0) return null;
    switch (saveFormat)
    {
        //… - here other formats convertion were skipped
        case DOC:
        case DOCX:
        case DOCM:
        case DOT:
        case DOTM:
        case DOTX:
            DocSaveOptions dso = new DocSaveOptions();
            dso.setUpdateFields(false);
            dso.setSaveFormat(saveFormat);
            return dso;
        case ODT:
        case OPEN_XPS:
        case OTT:
            OdtSaveOptions oso = new OdtSaveOptions(saveFormat);
            oso.setUseHighQualityRendering(true);
            oso.setPrettyFormat(true);
            return oso;
        default:
            return null;
    }
}

What about convertion from PDF, I use convertion from PDF to DOCX at the beginning then from DOCX to ODT.

awais.hafeez · May 8, 2016, 10:44pm

Hi Dmitry,

Thanks for your inquiry.

First off, please note that the DOCX/TXT formats may contain features that are not compatible with the ODT format. For more information on incompatible features and on changing file formats, please check with Microsoft support. e.g.
Differences between the OpenDocument Text (.odt) format and the Word (.docx) format

Secondly, I have generated some ODT files using Aspose.Words 16.3.0 and MS Word 2016 and attached them here for your reference. Please create comparison screenshots highlighting (encircle) the problematic areas in these Aspose.Words generated ODT files and attach them here for our reference. We will investigate the highlighted issues on our end and provide you more information. Thanks for your cooperation.

Moreover, you can use “Aspose.Pdf for .NET API” to convert PDF to Word format and then “Aspose.Words for .NET API” to convert Word format to ODT etc. Hope, this helps.

Best regards,

PjCouldBe · May 9, 2016, 5:21am

Hello! Thank you for quick reply!

I’ve made comparison screenshots for your documents (they are attached in screenshots.rar). The archive contains subfolder named by original document titles. Each subfolder contains set of screenshots named as 1, 2, 3 … sequence members with following prefixes:
- "CRITICAL FROM SOURCE - " prefix means that comparison was between source document and converted by Microsoft Office to determine violations which may be interesting for some reasons, e. g. if they might be inherited.

- "COMPARED WITH SOURCE - " prefix means that comparison was between source document and converted by Aspose-16.3.0.

- no prefix means that omparison was between source converted by Microsoft Office and converted by Aspose-16.3.0 documents.

On each screenshot on the left side displayed a piece document with violation and correct region is on the right side.

Despite of all violations your conversion variant with Aspose 16.3.0 is better than ours present one. So could you give the code snippet that represents your variant?

Thanks a lot!

awais.hafeez · May 10, 2016, 1:48am

Hi Dmitry,

Thanks for the additional information. We are working over your issues and will get back to you soon.

Secondly, I used the following simple code to generate these ODT files on Windows 10 machine:

Document doc = new Document(getMyDir() + "input.docx");
doc.save(getMyDir() + "output.odt");

Best regards,

PjCouldBe · May 17, 2016, 12:32am

Good day! Thank you! And have you any progress with our issues?

awais.hafeez · May 18, 2016, 12:51am

Hi Dmitry,

Thanks for being patient.

Regarding screenshots contained in 106 and ARC SubLicense Agreement 0315 folders:

After an initial test with Aspose.Words for .NET 16.4.0, I was unable to reproduce these issues on my side. I would suggest you please upgrade to the latest version of Aspose.Words. You can download it from the following link. I hope, this helps

https://downloads.aspose.com/words/java

Regarding screenshots contained in 2013 FIC License Agreement folder:

I managed to reproduce following issue on our end:

WORDSNET-13568: Left indentation of list items is incorrect in generated ODT (against 1.jpg)

However, I was unable to reproduce the issues highlighted in 2.jpg and 3.jpg on my side.

Regarding screenshots contained in EULA28JAN2014 folder:

I managed to reproduce following issue on our end:

WORDSNET-13569: Left indentation of Paragraph is incorrect in generated ODT (against COMPARED WITH SOURCE - 1.jpg)

WORDSNET-13570: Incorrect left indentation for nested list items in generated ODT (against COMPARED WITH SOURCE - 2.jpg)

WORDSNET-13571: No space between list numbers and list item texts in generated ODT (against CRITICAL FROM SOURCE - 2.jpg)

However, I was unable to reproduce the issues highlighted in CRITICAL FROM SOURCE - 1.jpg and CRITICAL FROM SOURCE - 3.jpg on my side.

Best regards,

PjCouldBe · May 18, 2016, 5:49am

Good day!

Great! Thank you! I’ll investigate your reply at closest time.

PjCouldBe · May 20, 2016, 3:48am

Good day!

I’ve downloaded latest Aspose libs (aspose-words-16.4.0 and aspose-pdf-11.5.0). It really became much better and all issues that you couldn’t reproduce are not reproducible for me too!

But there are some other issues, particularly with PDF to ODT, that I can’t manage by myself. I’ve attached the archive with PDF folder in it. There are source pdf document, odt document (obtained via convertion) and screenshots with problems (in Screens) in every subfolder (named as appropriate documents). So, please verify these issues and tell me are they bugs or how to avoid them otherwise.

Additionally, I discovered some different problem with EULA28JAN2014 license that you already have verfied. May be I have done something wrong, so please check screenshot with the error (in the same-name folder).

And finally I have obtained an odt of ARC license document (attached in “ARC SubLicense Agreement 0315” folder). And there are some empty pages. I understand that it is not a bug as they appeared because of spaces after, page breaks and so on. So could you provide code piece of some advise to get rid of them?

Thank you for your help!

awais.hafeez · May 23, 2016, 3:48am

Hi Dmitry,

Thanks for your inquiry. Regarding EULA28JAN2014, we have already logged three issues. Is this a new issue you are reporting?

Regarding PDF to ODT conversion problems, how are you doing these conversions. Are you using Aspose.Words + Aspose.Pdf for this? Please share piece of source code you’re using on your end.

Regarding the blank pages issue, we have logged this issue in our bug tracking system. The ID of this issue is WORDSNET-13594. Your request has also been linked to the appropriate issue and you will be notified as soon as it is resolved. Sorry for the inconvenience.

Best regards,

PjCouldBe · May 23, 2016, 11:21am

Good day! Thank you for yuor reply!

What about EULA28JAN2014 - yes this issue was not discovered in previous archive. There are mistakes in sublist items numeration.

Regarding to PDF conversion - yes I use two level conversion. Firstly from PDF to DOCX and then from DOCX to ODT. The code is such one:

public com.aspose.words.Document to(com.aspose.pdf.Document src, String to) {
    to = to.startsWith(".") ? to.substring(1) : to;
    com.aspose.pdf.SaveOptions opts = getPdfSaveOptions(getFormatFromString(to, com.aspose.pdf.SaveFormat.class));
    if (opts == null) {
        return throughDocx(src, to);
    }
    return fromPdfStraightlyTo(src, opts);
}

private com.aspose.words.Document throughDocx(com.aspose.pdf.Document src, String to) {
    com.aspose.pdf.DocSaveOptions opts = new com.aspose.pdf.DocSaveOptions();
    // opts.setFormat(DocX);
    opts.setRecognizeBullets(true);
    return new OfficeConvUtils( fromPdfStraightlyTo(src, opts) ).to(to);
}

private com.aspose.words.Document fromPdfStraightlyTo(com.aspose.pdf.Document src, com.aspose.pdf.SaveOptions opts) {
    ILogger logger = LoggerSystem.getLogger(this.getClass().getName());
    byte[] bytes = new byte[0];
    try (ByteArrayOutputStream bos = new ByteArrayOutputStream()) {
        src.save(bos, opts);
        bytes = bos.toByteArray();
    } catch (Exception e) {
        logger.warn("Fail converting document because of:\n" + e.getMessage());
        return null;
    }
    try (ByteArrayInputStream bis = new ByteArrayInputStream(bytes)) {
        return new com.aspose.words.Document(bis);
    } catch (Exception e) {
        logger.warn("Fail converting document because of:\n" + e.getMessage());
        return null;
    }
}

private com.aspose.pdf.SaveOptions getPdfSaveOptions(int saveFormat) {
    if (saveFormat < 0) return null;
    switch (saveFormat) {
        case Doc:
        case DocX:
            com.aspose.pdf.DocSaveOptions saveOptions = new com.aspose.pdf.DocSaveOptions();
            saveOptions.setFormat(saveFormat == DocX
                    ? com.aspose.pdf.DocSaveOptions.DocFormat.DocX
                    : com.aspose.pdf.DocSaveOptions.DocFormat.Doc);
            saveOptions.setRecognizeBullets(true);
            return saveOptions;
        case Excel:
            return new ExcelSaveOptions();
        case Html:
            return new com.aspose.pdf.HtmlSaveOptions();
            //…
        default:
            return null;
    }
}

protected int getFormatFromString(String format, Class<?> saveFormatClass) {
    if (saveFormatClass != SaveFormat.class && saveFormatClass != com.aspose.pdf.SaveFormat.class) {
        throw new IllegalArgumentException(
                " You must give only com.aspose.words.SaveFormat or com.aspose.pdf.SaveFormat integer argument! ");
    }
    String s = format.trim().toLowerCase();
    try {
        for (Field f : saveFormatClass.getFields()) {
            if (s.equals(f.getName().trim().toLowerCase())
                    && Modifier.isStatic(f.getModifiers())
                    && Modifier.isFinal(f.getModifiers()))
            {
                return f.getInt(f);
            }
        }
    } catch (IllegalAccessException e) {
        return -1;
    }
    return -1;
}

Please notice, that if in throughDocx() method leave “opts.setFormat(DOCX)” row uncommented then there will be generated empty documents. If DOCX replaced to DOC in this row then all pages of source document will be overlapped one another onto the only page. And if I delete this row then there normal documents will be generated (but with these issues, of cource) - well, I’ve got such results.

muhammad.ijaz · May 24, 2016, 11:51pm

Hi Dmitry,

We are further investigating the issue and will update you soon.

Best Regards,

PjCouldBe · June 10, 2016, 4:41am

Good day! Have you any progress with our issues?

codewarior · June 13, 2016, 9:05am

PjCouldBe:

Good day! Have you any progress with our issues?

Hi Dmitriy,

Thanks for your patience.

We are in the process of testing above stated scenario and will keep you posted with our findings. Meanwhile during our testing, we have found that when converting DataLicenseAgreement.pdf file to DOCX format using Aspose.Pdf for Java 11.5.0, the only formatting issue is missing indent for list item 6 and rest of the document appears to be fine. For the sake of correction, we have logged it as PDFJAVA-35882 in our issue tracking system. Meanwhile we are working on testing DOCX to ODT and similar scenarios for other documents and will keep you updated with our testing results.

We are sorry for this delay and inconvenience.

aspose.notifier · July 12, 2016, 11:12pm

The issues you have found earlier (filed as PDFJAVA-35882) have been fixed in Aspose.Pdf for Java 11.7.0.

This message was posted using Notification2Forum from Downloads module by Aspose Notifier.

PjCouldBe · July 13, 2016, 2:06am

Ok, thank you! And have you any progress with bugs of docx to odt?

awais.hafeez · July 14, 2016, 1:31am

Hi Dmitry,

Thanks for your inquiry. Unfortunately, these issues are not resolved yet. These issues are currently pending for analysis and are in the queue. We will inform you via this thread as soon as these issues are resolved. We apologize for your inconvenience.

Best regards,

PjCouldBe · July 18, 2016, 3:45am

Good day!

I have checked your update of Aspose.pdf library (11.7.0 version) in details. The issue you said about is actually fixed but not only. I’ve checked whole folder docs-with-issues.rar, pdf subfolder violations and many other problems listed in this folder are fixed too. It seems that I could reproduce only 2 issues form there. If you wish I can update issues screen list in pdf documents to leave only reproducible ones.

But there are bad news too. You already know that I use Aspose.PDF + Aspose.Words functionality to convert Pdf to Odt format. And til I updated the library (previous version was 11.5.0) the transforamtion chain PDF -> DOCX -> ODT gives good results but now after converting framed DOCX got from PDF to ODT gives documents with glued and put over frames on single page. This problem was earlier but had been reproduced in only some cases (like conversion through DocSaveOptions without pointing desired format or some else I unfortunately don’t remember), but now this issue is reproduced consistently. I have attached some documents in next format: each subfolder named as source document contains pdf document (source), docx converted version and odt version. The transformation code is same as I wrote some messages above.

It stands to mention, that if conversion chain will be as PDF -> DOCX -> HTML -> ODT the document is not only right converted but the text pieces are extracted from frames and we recieve normal text document as usual editable DOCX (the result we are very likely to get) but there some other issues for example some violation with fields, indesirable linebreaks between bullets and relative headers and so on. Could you also offer some better way to convert pdf to normal editable document (DOCX or ODT) besides only through html?

muhammad.ijaz · July 19, 2016, 6:04am

Hi Dmitry,

We are working on your query. Please share the list of remaining issues you want to be resolved.

Best Regards,

PjCouldBe · July 20, 2016, 3:09am

Good day! I’m sharing you an updated archive with left issues for PDF to DOCX converting, which located in PDF subfolder. Other subdirectories (with DOCX to ODT issues) are left without any chnges for convinience.

And it should be recalled that there is now a bug of converting PDF to ODT through DOCX which reproduced consistently (from 11.7.0 or 11.6.0 version). The essence of bug is that in result ODT documents all frames got from PDF are glued and put over each other (I shared relative documents in my previous reply).

muhammad.ijaz · July 21, 2016, 3:17am

Hi Dmitry,

Your mentioned issues can be reproduced during DOCX to ODT. Aspose.Pdf generates correct output during PDF to DOCX. DOCX to ODT issues have already been logged. We will keep you updated on these issues in this thread.

Best Regards,