Couple of Questions

Hi,

I am currently evaluating Aspose.Word as an addition to Aspose.Excel that we’re already using.

In order to stress Aspose.Word I have set up a sample application that mimics what our current production application does using RCW on MS Word.

NOTE: I will email a sample app and word documents to the word@aspose.com email address

This is the application’s background:

Generate a Word document with a specific area (the only area formatted with the Standard style; all other areas are formatted using a special DO NOT TRANSLATE style) in which translators may enter their translations. Above this area there is some maintenance information. Also, add a couple of custom properties to identify the document once it gets sent back by translation services.

This is what the app does:

1 Load in a Template
2 Populate some maintenance field values in a top level tabel
3 Populate the original (to be translated text) into two fields. This text may wrap multiple lines
4 Set custom properties so that the document may be identified when received back from the translation service.
5 Save the document in a Word 97 compatible format

This is where the first problem arises: I can do 1-4 fine, even 5 is fine but only for Word 2002. When I load in the generated document into Word 97 the table looks distorted and background colors are mixed for some fields.

Now the document will be sent to a translation service, the one field will be changed/overwritten with the translation (in Standard style) and the document will be sent back to us. What the app needs to do now is

1 Read in the custom properties
2 Read/Concatenate all the text that is of style Standard

This is my second problem: I can read the properties without problems, but traversing paragraphs or finding paragraphs of a specific style seems impossible.

This is how I do it using RCW:

private string GetTranslationText(Word.Document document)
{
    object what = Word.WdGoToItem.wdGoToLine;
    object which = Word.WdGoToDirection.wdGoToFirst;
    Word.Range range = document.GoTo(ref what, ref which, ref \_missing, ref \_missing);
    Word.Find find;
    object findText = "";
    object forward = true;
    object format = true;
    object style = "Normal";
    string retVal = null;

    try
    {
        if (range == null)
        {
            return (retVal);
        }
        range.Select();

        find = document.Content.Find;
        find.ClearFormatting();
        find.set_Style(ref style);

        string text = "";
        while (find.Execute(ref findText, ref _missing, ref _missing, ref _missing, ref _missing, ref _missing,
        ref forward, ref _missing, ref format, ref _missing, ref _missing, ref _missing, ref _missing, ref _missing,
        ref _missing) == true)
        {
            object unit = Word.WdUnits.wdParagraph;
            object count = 1;

            range = (Word.Range)find.Parent;

            if (Convert.ToString(range.Text).Equals("\r") == true)
            {
                break;
            }

            text += Convert.ToString(range.Text);
            range.Move(ref unit, ref count);
        }

        // Convert the text to the current code page
        Encoding unicode = Encoding.UTF8;
        Encoding ascii = Encoding.Default;
        retVal = ascii.GetString(Encoding.Convert(unicode, ascii, unicode.GetBytes(text)));

        // Set the return Value
        retVal = text.Replace(Convert.ToChar(8217), '"');
    }
    catch
    {
    }

    return (retVal);
}

If these two issues could be solved rather soon, I’d think about licensing Aspose.Report Corporate as we might also require Aspose.Pdf rather soon and would also extend our extsing Aspose.Excel license

Any help would be highly appreciated

Regards

Kai Iske
DWS Holding & Services

Hi,

It is technically very difficult for us to implement object model similar to MS Word because of the many live collections (paragraphs of a range in your case).

But we might be able to do something using IDocumentWriter when it becomes available.

To answer your question: we will think about what we can do, but it’s a new dimension to our existing plans so I cannot yet promise when it will be implemented.

Roman,

thanks for coming back to me. I asume that you think about implementing the text/paragraph extraction using new interfaces, because setting up the document works fine for me. Only for the extraction of only Standard style based paragraphs I would need some extension.

But what about the Word document not being correctly displayed within Word 97 after it has been loaded/saved through Aspose.Word?

Regards

Kai Iske

Yes, we will try to fix the distorted tables issue ASAP.

Please check latest Aspose.Word 1.5.1 hotfix - it addresses the tables in Word 97 issue.

We are still thinking how to allow you to enumerate paragraphs of a specific style. I will keep you posted.

Roman,

I just wanted to take the chance and get back to a question I have posted earlier this year regarding Aspose.Word.

You may find my initial question here: https://forum.aspose.com/t/130361

Question now is: Ist this possible using Aspose.Word now, i.e. finding paragraphs of text loaded into Aspose.Word? This is the only bit that is missing for me to make Aspose.Word really work for my solution.

We’ve been working on the whole thing with Range, Find and Replace, but there is nothing to release yet.

By the way, I checked your requirements once again and it looks you don’t need to replace text in the document, right? You only need to enumerate through it and do your own thing.

If that’s the case I’m pretty sure we can speed up delivery of this enumeration feature as we really want to see you among the users of Aspose.Word.

Roman,

this sounds like great news to me. Yes, I’d only need to be able to extract paragraphs formatted in a given style, say Standard/Normal. There might be paragraphs formatted in styles I’m not interested in so I’d either need to be able to enumerate paragraphs and read their style or find paragraphs given a style name. Then from the found paragraph I need to extract the plain text.

Could you possibly keep me posted on the progress?

TIA

Kai

Hi Kai,

Aspose.Word 1.8.5 has new Document.Accept and IDocumentVisitor to allow enumeration over document content.

Here is sample code:

[TestFixture]
public class TestModelEnumerator
{
    [Test]
    public void TestExtractNormal()
    {
        Document doc = TestUtil.Open(@"Model\TestModelEnumerator.doc");
        NormalExtractingVisitor visitor = new NormalExtractingVisitor();
        doc.Accept(visitor);
        Assert.AreEqual(
        "Normal line 1.\r" +
        "Normal line 2.\r" +
        "\x000c",
        visitor.GetExtractedText());
    }
}

/// 
/// Sample class that shows how to implement IDocumentVisitor
/// to extract text from all paragraphs of Normal style.
/// 
public class NormalExtractingVisitor : IDocumentVisitor
{
    void IDocumentVisitor.DocumentStart(Document doc)
    {
        extractedText = new System.Text.StringBuilder();
    }

    void IDocumentVisitor.DocumentEnd()
    {
        //Do nothing.
    }

    void IDocumentVisitor.SectionStart(PageSetup pageSetup)
    {
        //Do nothing.
    }

    void IDocumentVisitor.SectionEnd()
    {
        //Do nothing.
    }

    void IDocumentVisitor.StoryStart(StoryType storyType)
    {
        //Do nothing.
    }

    void IDocumentVisitor.StoryEnd()
    {
        //Do nothing.
    }

    void IDocumentVisitor.ParagraphStart(ParagraphFormat paragraphFormat)
    {
        isExtracting = (paragraphFormat.StyleName == "Normal");
    }
        
    void IDocumentVisitor.ParagraphEnd()
    {
        isExtracting = false;
    }

    void IDocumentVisitor.RunOfText(Font font, string text)
    {
        if (isExtracting)
            extractedText.Append(text);
    }

    public string GetExtractedText()
    {
        return extractedText.ToString();
    }

    private bool isExtracting;
    private System.Text.StringBuilder extractedText;
}

Roman,

thanks for this update. This sounds like what I am looking for.
However, I have run a quick test against it and results are close but not yet what I need.

I will send you an email with further details (code, template etc.)

Regards

Kai

Roman,

as stated in my forum post, I am attaching a sample application that performs all the steps that are required by my application. The final test that it is running is extracting a translation text using your Visitor pattern implementation.

However, I get the following paragraph text:

\r\r\rEvaluation Only. Created with Aspose.Word. Copyright 2003-2004 Aspose Pty Ltd.\r\a\a\a\a\a\a\a\a FORMTEXT ’14Original Text\a\a

Of course I am getting the Eval Text which is fine, but all the other control characters and FORMTEXT specifiers shouldn’t be there, should they? All I way expecting was the clear text Eval Notice and the text “Original Text”.

Hi Kai,

We did not promise your original code that works for MS Word will work without any changes. Let’s work together to finalize the solution.

\r - this is end of paragraph character

\a - this is \x0007 and means end of cell or end of row character \x000c - end of section character

Then, your “Original Text” is actually a value inside a text input form field. In MS Word documents, fields are stored as text just inside the document. Basically, what you have is:

\x0013 - field start character

\x0014 - field separator

\x0015 - field end character

field code field value

\x0013 FORMTEXT \x0014Original Text\x0015

As you can see, it is easy for you to filter our \r or \a or other unwanted characters, just use string.Replace for that.

To filter out field codes and field character, one needs to implement a simple state machine that could work like this:

if got field start character

isExtracing = false

if got field separator

isExtracting = true

To avoid searching for field start/end/separator characters inside the strings, I can add special methods to IDocumentVisitor such as FieldStart, FieldSeparator and FieldEnd. But this will be merely a cosmetic and you will still need a simple state to skip field codes.

Roman,

that’s perfectly fine for me. I just didn’t know how to interpret the various control sequences and characters. Now that I know I’ll try and dig my way through to the actual text that I’m interested in.

Thanks very much again.

I’ll do some further testing (also with regards to Aspose.Pdf) and we’ll see what comes next

Regards

Kai

Roman,

yes, of course, I was able to correctly extract the text from the field et al.
It would be really nice to have some sort of class that holds commonly used control chars/sequences in order to be able to perform searches based on constants rather than the “cryptic” hex-character sequences Or, following your suggestion, have methods that determine whether a given string contains control characters/sequences of a given kind.

However, what I noticed with a generated Word document is the following: When I open the document in Word 2002 and attempt to save it, Word loops forever.

I will send you the Word document in a seperate email.

Regards

Kai

What exactly do you do to reproduce the problem when saving the document in MS Word?

I can save it okay. I just click the Save toolbar icon. I’ve also tried Save As command.

Roman,

I simply open up the document (note: it has been generated using the test app I sent you earlier) and hit Ctrl-S, Toolbar button, Menu Item. Whatever I do, Word enters infinite loop state chewing up 100% of CPU.

I’m using Word 2002 SP 2 German.

Regards

Kai

I’ve tried it on Word XP SP1 (which is Word 2002 I presume) and SP3 as well (could not get SP2 installed). Also tried on Word 2000. It’s English of course, not German. Not sure if I have German version around handy, but I’ll have a look. Sorry I could not solve this quickly.

In general, the way we try to resolve such things is to gradually remove some elements from the document (in your case from the original template) and then see if the problem disappears. This gives an idea what document element causes the problem and then we can work on it.

Roman,

sorry, but I totally forgot that the Document contains some Macro that gets called when saving the document. There was a bug in the macro that seemed to show up the first time I tried it with the Aspose.Word generated document. I have now implemented a work around/fix for this bug and it works like charm.

Thanks for your help. I will now continue evaluating/testing the component

Regards

Kai

Roman,

going further with my testing I found out that the document opens as an OLE object in Word 97 rather than a real Word 97 document. Did you skip the native Word 97 support in the meantime?

…or is it just that I’m missing anything again?

Regards

Kai

Not sure what do you mean opens as OLE object? Ok, I’ll try opening the document in MS Word 97.