Formatting Text within Custom Tags

dsauter · July 22, 2009, 11:36am

I hope someone can help me with a few questions. We are looking at migrating away from Word COM Automation to avoid installing MS word on our servers. Currently, we are loading a text file into a Word Document and then looking for custom tags to perform various formatting, e.g. inserting paragraphs, bolding, indenting, etc. My questions are:

We use Word COM methods such as MoveDown to move the cursor down “x” number of lines to insert text. Is there an equivalent method in Aspose to move the cursor down “x” number of lines?
The text file that we load into Word contains the “\n” newline characters. What would be the appropriate way to convert these “\n” newline characters into actual paragraph breaks in the Word document?
We use custom tags to delimit text that should be formatted. For example, we use the “|” pipe character between text that should be bolded, e.g. “|This text should be bolded|”. I’m assuming that I would somehow use the Replace Method with a Replace Evaluator but I can’t quite figure out how based on the examples that I’ve seen.
Thanks in advance for any help!

Klepus · July 22, 2009, 6:26pm

Hello!
Thank you for considering Aspose.
You can replace Microsoft Word Automation with Aspose.Words with some minor changes. Our team will help you in migration. Here are the answers:

In Aspose.Words there is no direct equivalent of MoveDown method. It deals with logical document entities such as sections and paragraphs, not lines of text. Line is an entity related to layout, not structure. Most probably you can replace MoveDown with navigation among the document nodes. For instance, you can find X-th paragraph after the current one. If you show me a realistic sample I would try to advise more.
This is very easy. You can pass multiline strings to DocumentBuilder.Write method and it will split them to paragraphs:
Document doc = new Document();
DocumentBuilder builder = new DocumentBuilder(doc);
builder.Write(“Some\nmultiline\ncontents”);
You can locate your custom tags with Range.Replace overload with ReplaceEvaluator:
https://reference.aspose.com/words/net/aspose.words/range/replace/
But there is no one-call API to apply formatting to the matched pattern. Formatting such as bold attribute is applied to text runs. Since the matched pattern can occupy several runs and start or end in the middle of runs they should be split. Maybe this could be done easier, I’ll think a bit more.
Please feel free to ask any other questions. You feedback is very important and much appreciated.
Regards,

dsauter · July 22, 2009, 7:15pm

Thanks for the quick reply. It is much appreciated.
For Item 1), I’ll play around with the “MoveTo” method in conjunction with the Sections and/or Paragraphs to see if I can replicate the functionality. The following Word COM code is basically the code that I want to be able to replicate:

Word.ApplicationClass myAppClass = new Word.ApplicationClass();
object myMissing = Missing.Value;
myAppClass.ActiveWindow.ActivePane.View.SeekView=Word.WdSeekView.wdSeekFirstPageHeader
object myDowncount = 2
object myMoveUnit = Word.WdUnits.wdLine;
myAppClass.Selection.MoveDown(ref myMoveUnit, ref myDowncount, ref myMissing);
myAppClass.Selection.InsertAfter("Insert Text Here - 2 lines down from last line in the Header );

For Item 2), when I load the text file that has the “\n” characters already in the text file, they are not translating to paragraph breaks. In other words, they are being treating as the literal string “\n”. I used the “LoadTxt” example from your Web site (download file is Aspose.Words.NET.Samples.zip). Is there some other way that I need to load the text file in order to have the “\n” translated to a paragraph break? I’m not creating the file from scratch using the Write method but instead loading an existing text file.
For Item 3), I saw the example where text is highlighted but I’m having a little trouble because I’m not actually searching for text but trying to extract out a section of text between certain markers, e.g. “|” or “~” in the text and trying to apply the formatting just to that section. I think I may have to look into calling the ReplaceEvaluator function twice to get the beginning and ending positions and somehow apply the formatting to that substring of text between my markers I’m not sure.
One followup question I have is that when I use the ReplaceEvaluator, it appears from the e.MatchOffset that it is searching the document from the bottom to the top. In other words, on successive calls the ReplaceEvaulator, the e.MatchOffset value is getting smaller and finding matches in the direction of the beginning of the document. Is that normal or am I doing something wrong?
Thanks again for any suggestions you might have.

Klepus · July 23, 2009, 7:05am

Hello!
Thank you for your thoughtful feedback.
For Item 1), I see the task is placing some text two lines down from the last line in the header. This means you are adding two empty paragraphs and a paragraph with text. Okay, this can be performed with DocumentBuilder class:

private static void Sample()
{
    Document doc = new Document("source.doc");
    // DocumentBuilder is a helper class making easier manipulations with a document.
    // It acts as a cursor, a bit analiogical to Selection in Automation.
    DocumentBuilder builder = new DocumentBuilder(doc);
    // Move to the first page header. If the header doesn't exist it will be created automatically.
    // This will place the cursor at the beginning of the header.
    builder.MoveToHeaderFooter(HeaderFooterType.HeaderFirst);
    // Move to the end of the header (which is the current story).
    builder.MoveTo(builder.CurrentStory.LastParagraph);
    // Add some text here. '\n' characters will be automatically recognized and extra paragraphs added.
    builder.Write("\n\nInsert Text Here - 2 lines down from last line in the Header");
    // This line is needed if your document didn't have the first page header before.
    // Even if DocumentBuilder created it automatically, header will be hidden until this property is true.
    // If all source documents already have first page headers then you don't need it.
    doc.FirstSection.PageSetup.DifferentFirstPageHeaderFooter = true;
    doc.Save("result.doc");
}

For Item 2), You can use the same technique as in 1). Just load your text file and add all its contents with DocumentBuilder.Write. It automatically splits text parts between ‘\n’ to paragraphs.
For Item 3), The article you mention describes a very similar task. I’m also looking on it and thinking what to change for your case:
https://docs.aspose.com/words/net/find-and-replace/
Range.Replace method accepts a Regex, a class representing regular expression. You can find not only constant text but fragment matching any pattern expressed as a regular expression. In your case I would try regex like this:
Regex regex = new Regex(@“|.*|”);
Here is my favorite site about regular expressions:
http://www.regular-expressions.info/
ReplaceEvaluator can perform simple text replacements natively. But if you need to change formatting then standard replacement should be intercepted. Otherwise node structure with just changed formatting will be modified without control from application code.
In your ReplaceEvaluator you should do one more thing: cut the vertical pipe characters at beginning and end of the phrase – from the first and last collected runs. If any of these runs become empty then just remove it from the document. Other actions are the same: change formatting for all collected runs and return ReplaceAction.Skip instead of ReplaceAction.Replace.
This all looks tricky. We plan to improve usability in the future. I registered this case as a new issue (reference number #9764). If you have any ideas please share with our team.
For Item 4) (new question), This overload has a parameter defining direction:
https://reference.aspose.com/words/net/aspose.words/range/replace/
ReplaceEvaluatorArgs.MatchOffset is not a position within the document. It’s a position from the start of ReplaceEvaluatorArgs.MatchNode.
Regards,

dsauter · July 23, 2009, 8:34am

Thanks again for your quick response. Your comments are really helpful. Sorry to keep asking more questions but I would really like to use your product if I can figure out how to do the same things we are doing with Word COM Automation.
For Item 1), thanks - this looks straightforward enough so I will give that a try
For Item 4), thanks - I realized I was calling the replace evaluator with the false parameter for the direction so it wasn’t searching forward, duh.
For Item 2), my text file does already contain the “\n” characters in it but it does not seem to be replacing them with actual paragraph marks. The code I tried is as follows:
The code below is from the “LoadTxt” example on your Web Site.

using (StreamReader reader = new StreamReader(dataDir + "sdnlist.txt", Encoding.UTF8))
{
    // Read plain text "lines" and convert them into paragraphs in the document.
    while (true)
    {
        string line = reader.ReadLine();
        if (line != null)
            builder.Writeln(line);
        else
            break;
    }
}

I also tried the following:

using (StreamReader reader = new StreamReader(dataDir + "sdnlist.txt", Encoding.UTF8))
{
    // Read plain text
    while (true)
    {
        string text = reader.ReadToEnd();
        builder.Write(text);
        break;
    }
}

However, neither of these seem to convert the “\n” to a paragraph break. In fact, if you look at the text after it has been loaded using the Document.Range method, I noticed that the “\n” characters have actually been converted to “\n”, i.e.an additional backslash(), has been prepended.The only way I seem to be able to get paragraph breaks in the document is to use the following:

sDoc.Range.Replace(new Regex(@"\\n"), new ReplaceEvaluator(ParagraphEvaluator), false);
}
private static ReplaceAction ParagraphEvaluator(object sender, ReplaceEvaluatorArgs e)
{
    e.Replacement = ControlChar.ParagraphBreak;
    return ReplaceAction.Replace;
}

But the document is fairly large and it takes over 5 minutes to go through and put the paragraph breaks in. It would seem that since the “\n” is already in the text document when it is loaded that I shouldn’t need to do this step.
For Item 3), I did try the regular expression that you mentioned but the problem (and it may be related to Item 2) above, is that it is matching the text from the very first occurrance of the “|” to the very last occurrance of the “|” instead of finding each “pair”. For example, given the following text:
|This should be bolded|This should not be bolded|This should be bolded again|
It is returning the entire string as a match. What I want is actually 2 separate matches, i.e.
|This should be bolded|
|This should be bolded again|
but not the text “This should not be bolded”
One other interesting thing that may be related to the paragraph problem is that when I get in the ReplaceEvaluator method and look at the value of e.MatchNode, it is returning the entire document. Is that because the document is just one huge paragraph? I’m thinking that if I can get the document actually broken into paragrpahs before I start matching text then it might help things.
I’m also a little unclear on the concept of the “Run”, since my text is all basically formatted, it seems that I only have 1 run, which is the entire document. It seems like I need to be able to extract text between my markers, make that a “run”, and then format that run. Is that correct?
Thanks again for all of your help. Your product looks very nice and I hope we can get past these last few issues.

Klepus · July 23, 2009, 6:01pm

Hi again!
That’s very strange if the lines from the input file are not split to paragraphs. I’ve tried the code below with attached file and all works fine: I’ve got a document with three paragraphs. If you cannot get right with this please attach your text file here.
The code:

Document doc = new Document();
DocumentBuilder builder = new DocumentBuilder(doc);
using (StreamReader reader = new StreamReader("test.txt"))
{
    string text = reader.ReadToEnd();
    builder.Write(text);
}
doc.Save("result.doc");

Regarding replacements in large documents you are right: they are very slow. The more replacements occur in the document the slower all procedure is. This is a known issue logged as #3028. We’ll notify you when some search/replace performance improvements are suggested.
If my regular expression doesn’t find multiple occurrences correctly you can try so named “lazy star”. Lazy capturing wildcards tend to capture the least possible text. You can read more about this on the site I referenced previously. Here is the sample I experimented with:

const string text = "|This should be bolded|This should not be bolded|This should be bolded again|";
Regex regex = new Regex(@"\|.*?\|");
MatchCollection matches = regex.Matches(text);
string m1 = matches[0].Value;
string m2 = matches[1].Value;

Regarding MatchNode, from my experience it’s usually a node of class Run. Maybe we should first get right with other things, e. g. properly insert and find text. The remaining should be clearer. If you look to the sample with text highlighting then you can see: MatchNode references Run instances on every call. The code even doesn’t check this fact, just casts to Run.
I’ll explain in more words what run (class Run) is. Basically run is a part of text in a paragraph that has individual set of attributes. For instance, you can have text parts that are formatted differently or belong to different revisions in one paragraph. They can also be divided by other document nodes such as bookmark start/end or smart tags. This is illustrated in the same article about highlighting text. A run cannot occupy the whole document. If you have some text then the document minimally has one section. That section has the only body node. Body has one paragraph. And that paragraph has one run. Why do we need all this magic with finding and splitting runs? Suppose your document contains the following text in one run:
“|This should be bolded|This should not be bolded|This should be bolded again|”
After you make delimited parts bold there will be three, not one: every part with distinct formatting occupies at least one run. Hopefully we have a working sample. And all we need is cutting vertical pipe characters from start and end run of the sequence. This is not difficult. Run class has Text property that can be set and got. So we just remove the pipe from starting or ending run containing more than one character. Outermost runs containing only one pipe character should be removed from the document completely.
Please let me know if I forgot anything.
Have a nice day,

dsauter · July 27, 2009, 12:35pm

Thanks again for all of your help. Your suggestions have given me enough to get it (mostly) working. I appreciate your detailed responses.
For some strange reason, I’m still having trouble with the “\n” characters in the source document. I’m attaching the input and output files that are produced with the following code:

Document doc = new Document();
DocumentBuilder builder = new DocumentBuilder(doc);
using (StreamReader reader = new StreamReader(dataDir + "List.txt"))
{
   string text = reader.ReadToEnd();
   builder.Write(text);
}
builder.Write("\nThis is a break\nThis is another break");
builder.Document.Save(dataDir + "List.doc");

The contents of List.Txt = “This is some text\nThere should be a paragraph break\nWe will see what happens\nWill it recognize the line feeds\n”
When I look at the builder.Document.ToTxt() property right before saving the document, the “\n” characters that were loaded from in List.txt input document have another backslash “” prependend to it. However, the “\n” characters that I manually added to the document after it was loaded was See the text below:
builder.Document.ToText()
----------------------------
Evaluation Only. Created with Aspose.Words. Copyright 2003-2008 Aspose Pty Ltd.\r\nThis is some text\nThere should be a paragraph break\nWe will see what happens\nWill it recognize the line feeds\n\r\nThis is a break\r\nThis is another break\r\n"
Not sure why that is or if it is causing a problem but the “\n” characters that I manually put in have the “\r\n” carriage return linefeed combinations characters substituted and they appear as paragraph breaks in the resulting Word document. It’s just the ones that I have input inside the text file.
The output file List.doc is attached. You will notice that the “\n” charcters do not come through as paragraph marks but the string “\n”. However, the text that I manually added in using builder.Write does translate the “\n” into a carrriage return-linefeed.
What am I doing wrong?
Thanks again for your help.

alexey.noskov · July 27, 2009, 1:48pm

Hi

Thanks for your request. Actually, ‘\n’ is a special character, but, in your case, when you inset ‘\n’ into TXT document, this is just sequence of two characters ‘\’ and ‘n’. You should just insert line breaks in your TXT file. I attached sample document.
Best regards.