Word to rtf conversion with absolute positioned textboxes renders incorrectly

cscarlsson · June 28, 2018, 10:00am

Hi there,

I am attempting to open a word document and save it as a rich text file, I am using the same code that I have been using for months to perform this operation on standard .doc files that are produced by humans and has not failed after processing around 12,000 documents.

The new file has been produced by a ‘word printer’, basically an export from a medical system, and is machine generated, absolutely everything on the input document has been placed in textboxes and absolutely positioned. When processing, the output is garbled as in the files below:

Input.zip (13.1 KB)

Output.zip (30.5 KB)

The code I’m using to create the RTF is a fairly straightforward clone as follows:

            //Aspose Licensing
            License lic = new License();
            lic.SetLicense(new MemoryStream(Properties.Resources.Aspose_Words));

            Document _wordDoc = new Document(Path.Combine(SourceFilePath, FileName), new LoadOptions(LoadFormat.Docx, string.Empty, string.Empty));
            LoadOptions _lo = new LoadOptions();
            _lo.LoadFormat = LoadFormat.Doc;  //***
            MemoryStream _template = new MemoryStream(Properties.Resources.QIRTFTemplate);
            Document _rtfDoc = new Document(_template, _lo);
            DocumentBuilder _db = new DocumentBuilder(_rtfDoc);
            _db.Font.Name = m_StandardFontName;
            _db.Font.Size = m_StandardFontSize;
            Node _insertAfterNode = _db.CurrentParagraph;

            // Make sure that the node is either a paragraph or table.
            if ((!_insertAfterNode.NodeType.Equals(NodeType.Paragraph)) & (!_insertAfterNode.NodeType.Equals(NodeType.Table)))
                throw new ArgumentException("The destination node should be either a paragraph or table.");

            // We will be inserting into the parent of the destination paragraph.
            CompositeNode dstStory = _insertAfterNode.ParentNode;

            // This object will be translating styles and lists during the import.
            NodeImporter _importer = new NodeImporter(_wordDoc, _insertAfterNode.Document, ImportFormatMode.KeepSourceFormatting);
            
            // Loop through all sections in the source document.
            foreach (Section _srcSection in _wordDoc.Sections)
            {
                // Loop through all block level nodes (paragraphs and tables) in the body of the section.
                foreach (Node _srcNode in _srcSection.Body)
                {
                    // This creates a clone of the node, suitable for insertion into the destination document.
                    Node _newNode = _importer.ImportNode(_srcNode, true);

                    // Insert new node after the reference node.
                    dstStory.InsertAfter(_newNode, _insertAfterNode);
                    _insertAfterNode = _newNode;
                }
            }

            //Take the first part of the filename and set the extension to rtf for the new file name
            string _newFileName = string.Format("{0}.rtf", FileName.Split('.')[0]);

            _rtfDoc.Save(Path.Combine(OutputFilePath, _newFileName), SaveFormat.Rtf);

Any help would be great

Thanks
Jason

awais.hafeez · June 28, 2018, 1:48pm

@cscarlsson,

Can you please create a standalone simple console application (source code without compilation errors) that helps us to reproduce your problem on our end and attach it here for testing? Thanks for your cooperation.

cscarlsson · June 28, 2018, 2:43pm

@awais.hafeez,

Please find attached app to demonstrate issue.

TestWordToRTF.exe

Parameters:
-in folder where the input file is stored
-out folder to create new file in
-file name of input file

TestWordToRTF.zip (4.7 MB)

Thank you for looking at this for me.

Kind regards
Jason

cscarlsson · June 28, 2018, 3:16pm

@awais.hafeez

Sorry, that was an exe, Source code attached. Been a long day.

It seems that I am unable to link to the upload, I have had a saved message but no link appears. I’ll try with a new message

Kind regards
Jason

cscarlsson · June 28, 2018, 3:29pm

It was file size, you will need to add in the nuget package for Aspose.Words 18.6.0

TestWordToRTF-ConsoleSource.zip (5.0 MB)

awais.hafeez · June 29, 2018, 1:46am

@cscarlsson,

Thanks for the additional information. We tested the scenario and have managed to reproduce the same problem on our end. For the sake of any correction, we have logged this problem in our issue tracking system. The ID of this issue is WORDSNET-17075. We will further look into the details of this problem and will keep you updated on the status of correction. We apologize for your inconvenience.

cscarlsson · June 29, 2018, 8:30am

@awais.hafeez

Thank you for the response your help is very much appreciated, I look forward to hearing from you soon ref this issue.

In the meantime is there a workaround you can suggest? I’m thinking that there may be a way to handle this with Runs but it feels like a fairly brittle solution as the layouts can vary from file to file and we are likely to be handling several hundred files per day. We are paid up on support so this issue could be raised as a ticket if that would help to expedite the matter any further.

Kind regards
Jason

awais.hafeez · June 29, 2018, 8:37pm

@cscarlsson,

Unfortunately, your issue is not resolved yet. This issue is currently pending for analysis and is in the queue. Once the analysis of this issue is completed and the root cause is determined, we may then be able to share a workaround with you. To increase the priority of this issue, you may post a request in Paid Support Helpdesk. We will keep you posted on further updates. We apologize for any inconvenience.

awais.hafeez · June 16, 2019, 6:44am

@cscarlsson,

Regarding WORDSNET-17075, your code imports content of every section to single section body in the destination. There are two issues with “Input.DOC” (actually it has RTF format) document:

Input document has section break with start of new section on the next page. So this document has two pages. Your code ignores sections, so as the result output document has single page with messed content. As the solution please import content with sections.

There is tricky setting for import functionality i.e. “IgnoreTextBoxes” (by default is “true”). As the result original formatting (SpaceAfter = 0) of textbox paragraphs are missed. The following code imports problematic document to the template as expected:

Document srcdoc = new Document(@"Input.rtf");
Document docDst = new Document(@"QIRTFTemplate.rtf");

docDst.Sections.Clear();
ImportFormatOptions options = new ImportFormatOptions();
options.IgnoreTextBoxes = false;

docDst.AppendDocument(srcdoc, ImportFormatMode.KeepSourceFormatting, options);

docDst.Save(@"Input_Aw_out.rtf");

So, please use ImportFormatOptions.IgnoreTextBoxes = false to get expected output related to textboxes. Hope, this helps.

We will keep you posted on any further updates and let you know once this issue is resolved.

aspose.notifier · November 6, 2019, 12:06pm

The issues you have found earlier (filed as WORDSNET-17075) have been fixed in this Aspose.Words for .NET 19.11 update and this Aspose.Words for Java 19.11 update.