How to delete text next to specific tags using .NET

brm · June 15, 2020, 7:58am

I’ve inherited a mailmerge system that I need to rewrite from word automation to use Aspose.
Both systems need to run parallel (for a time at least) so I need to mimic the old functionality.
One problem I’m running into is that the old system removed a tag and the next character.

I have a document that is used as a template.
Part of this template are some blocks and somewhere is decided what blocks should be included in the created document.
The blocks are marked by: “<sblok1>contents</sblok1>” (and any other numbers of course).
I’m able to find and extract the contents by using the ReplacingCallback and inserting bookmarks which are later used to (optionally) insert the contents.

My problem is there might be a character after the closing > which should be removed.
The old code actually has comments saying it should be a carriage return, BookmarkEnd:NextPreOrder(doc):getText() says it’s an Aspose.Words.ControlChar:ParagraphBreakChar)
It also might be something else entirely, since there is no validation whatsoever.

The old word automation method of removing these tags used Words Range.moveEnd to select the next character (whatever it was) and remove it with range.delete().

In the attached overview, I want to remove the purple ¶ characters: overview.png (28.3 KB)

How do I remove the character following a bookmarkend (or search result)?

tahir.manzoor · June 15, 2020, 4:06pm

@brm

Please note that Aspose.Words’ model is quite different from the Microsoft Word’s Object Model in that it represents the document as a tree of objects more like an XML DOM tree. If you worked with any XML DOM library you will find it is easy to understand and work with Aspose.Words. When you load a Word document into Aspose.Words, it builds its DOM and all document elements and formatting are simply loaded into memory. Please read the following article for more information on DOM:
Aspose.Words Document Object Model

Regarding deleting content, we suggest you please read the following article.
Find and Replace

If you still face problem, please ZIP and attach your input and expected output documents along with content that you want to remove. We will then provide you more information on it.

brm · June 26, 2020, 12:28pm

Thanks for your reply!
I studied the DOM and I still believe using BookmarkEnd:NextPreOrder() should return the correct/next node, but when I query the text of the node or its range I get seemingly incorrect values.
I also played around with removing the entire following paragraph, but that usually resulted in even worse results

I’ve attached the requested documents, but I am looking for a more generic approach, just because this template follows the blocks with a linefeed/empty paragraph does not mean all of the (literally thousands) others will.

Just in case I’m doing something entirely wrong in a preceding step, here’s my current workflow to convert the original document into one I use to later fill with the data:

search for (?i)<sblok(\d+)> with a ReplacingCallback to insert bookmarks named for the block
1.a I tried adding ., \s, \r and a slew of other whitespace or linebreak control characters, including &p, &b, &m and &l as the Aspose meta characters, but they usually just resulted in no matches.
loop through the inserted bookmarks, copy the contents to a new document and save it, set the bookmark:text to ""
Save the template

Then when it’s time to create a document find the bookmarked file (set Aspose.Words.SectionStart:Continuous) and start looping through the bookmarks, find the correct file with the bookmark’s contents, clone() it, replace() its fields and use builder:MoveToBookmark() and builder:InsertDocument().

deleting-next-character.zip (124.8 KB)

tahir.manzoor · June 26, 2020, 7:14pm

@brm

To ensure a timely and accurate response, please create a standalone console application ( source code without compilation errors ) that helps us to reproduce your problem on our end and attach it here for testing. We will investigate the issue and provide you more information on it.

PS: To attach these resources, please zip and upload them.

brm · August 25, 2020, 10:46am

Sadly I was not in the position to create a standalone console application (using a different language), but I solved my problem and will post my solution below.

I tried some other options and got it to work (for my specific situation) by one of two possibilities.

By not using a regex
Because the block-names where simply blok, followed by a sequential number I could just use an exhaustive search.
By not using a regex I could use &p to search for an empty paragraph, resulting in the correct match.

By using a different/wrong regex
By using a quantifier of 1 ({1}) I forced a search for exactly one match of any character (exactly as leaving the quantifier off should do as far as I know), which gave me the correct results.

i.e. (?i)<\/blok(\d+)>.{1} instead of (?i)<\/blok(\d+)>..

tahir.manzoor · August 25, 2020, 5:27pm

@brm

It is nice to hear from you that you have found the solution of your query. Please confirm if your problem has been solved.

brm · August 26, 2020, 7:00am

Through a workaround my problem has been solved.