How to compare Formatting?

hwellmann · March 22, 2011, 4:13am

Using document.joinRunsWithSameFormatting() on a Word document (binary *.doc), I still get a whole lot of one-character runs, and due to the obfuscated API and the lack of a useful Font.toString() method, I don’t understand why.

The difference seems to be specific to the binary Word format. When I save the document as *.docx and look at the XML, the runs are merged as expected.

Looking at the Font objects in the debugger, I can see two Lists on each object which look like a list of property IDs and property values. The one-character run has two additional property IDs 460 and 470.

Can you shed some light on this? Is there a documented API to inspect these properties?

Best regards,
Harald

adam.skelton · March 22, 2011, 4:42am

Hi Harald,
Thanks for your inquiry.
Could you please attach your document here for testing? We will inspect this further for you.
Thanks,

hwellmann · March 22, 2011, 6:08am

The document I’m working with contains confidential customer information, so I cannot attach it. I can try to extract a snippet and hope the problem will be preserved in the extract.

Best regards,
Harald

adam.skelton · March 22, 2011, 7:32am

Hi Harald,
Thanks for this additional information.
All documents that you attach to the forums can be opened only by you and Aspose staff and are treated confidentially. You can also choose to e-mail me by clicking the Contact button on my forum post and clicking “Send e-mail”. You could also try to reproduce the issue with dummy data if this is possible.
Thanks,

hwellmann · March 22, 2011, 9:54am

Hi Adam,

I was able to isolate the problem. The attached document contains 3 paragraphs, each of which is split into 2 or 3 runs for no obvious reasons.

Can you explain why?

Best regards,
Harald

adam.skelton · March 22, 2011, 5:34pm

Hi Harald,
Thanks for this additional information.
I managed to reproduce the issue on my side. I have logged this issue and it will be investigated further sometime in the future. Please note that some runs would not be merged anyway e.g the second block of runs are split by a bookmark, and in the third block the full stop is on a different paragraph.
For now as you found you can work around this problem by saving the document in DOCX format first. You can do this in memory. Please see the code below.

Document doc = new Document("at_a_glance.doc");
MemoryStream stream = new MemoryStream();
doc.Save(stream, SaveFormat.Docx);
Document doc2 = new Document(stream);
doc2.JoinRunsWithSameFormatting();
doc2.Save(dataDir + "Document out.doc");

Thanks,

hwellmann · March 25, 2011, 3:52am

Hi Adam,

I’m afraid that’s not a viable solution as the doc to docx transformation loses information.

Since Aspose Words obfuscates all information that could have helped me analyzing the problem, I used Apache POI to inspect my document.

It turned out that one of the problematic runs has a sprmCHresi property set, which confirmed my suspicion that the root cause had to be related to hyphenation.

As it seems, Aspose Words is able to recognize this property and therefore does not join two adjacent runs with otherwise identical format.

If this is the case, then this property should be accessible via the public API. The second best solution would be to completely ignore this property and silently merge the runs.

Best regards,
Harald

adam.skelton · March 25, 2011, 4:38am

Hi Harald,
Thanks for this additional information. I have added your findings to the issue description. In the mean time you can use the work around code below instead. You should call this method on the document instead of JoinRunsWithSameFormatting.

public static void ManualJoinRunsWithSameFormatting(Document doc)
{
    // Store links between the original document and the document saved in DOCX format.
    Dictionary <Run, Run> nodeLookupList = new Dictionary <Run, Run> ();
    Dictionary <Run, Run> reverseLookupList = new Dictionary <Run, Run> ();
    // Save the document in DOCX format so JoinRunsWithSameFormatting will work as expected.
    MemoryStream stream = new MemoryStream();
    doc.Save(stream, SaveFormat.Docx);
    Document cloneDoc = new Document(stream);
    // Add links of nodes between documents.
    NodeCollection docNodes = doc.GetChildNodes(NodeType.Run, true);
    NodeCollection cloneNodes = cloneDoc.GetChildNodes(NodeType.Run, true);
    for (int i = 0; i <docNodes.Count; i++)
    {
        nodeLookupList.Add((Run) docNodes[i], (Run) cloneNodes[i]);
        reverseLookupList.Add((Run) cloneNodes[i], (Run) docNodes[i]);
    }
    // Join runs with same formatting.
    cloneDoc.JoinRunsWithSameFormatting();
    // Check the result in the DOCX document. For each node which was removed (Parent == null)
    // we can assume that the formatting is really the same in the DOC document.
    // We can also assume the previous sibling is the one it's joined to and that the node is a Run.
    foreach(Run run in nodeLookupList.Values)
    {
        if (run.ParentNode == null)
        {
            Run docRun = reverseLookupList[run];
            JoinRuns((Run) docRun.PreviousSibling, docRun);
        }
    }
}
///
/// Joins the source run to the base run.
///
public static void JoinRuns(Run baseRun, Run sourceRun)
{
    baseRun.Text = baseRun.Text + sourceRun.Text;
    sourceRun.Remove();
}

Thanks,

aspose.notifier · December 31, 2012, 1:26am

The issues you have found earlier (filed as WORDSNET-4544) have been fixed in this .NET update and this Java update.

This message was posted using Notification2Forum from Downloads module by aspose.notifier.