Java Code to Find & Replace Text containing Line Breaks in Word DOCX Document

Hello,

in the most recent update (2020.7) there seems to be a change in behavior of

public int IReplacingCallback.replacing(ReplacingArgs e)

If I set a value containing “\n” or “\r” with

e.setReplacement(value);			
return ReplaceAction.REPLACE;

The line breaks don’t end up in the document as they used to be (at least not the PDF I generate).

I have to investigate further if I find some time, but I thought I’d let you know, because for us it’s a real showstopper for us: all sorts of our generated documents contained garbled lines if parts were based on strings containing line breaks.

After I went back to 2020.6 the problem disappeared.

Cheers
Dirk

@DirkSteinkamp,

To ensure a timely and accurate response, please ZIP and attach the following resources here for testing:

  • Your simplified input Word document
  • Aspose.Words 20.7 generated output DOCX document showing the undesired behavior
  • Your expected document showing the correct output. You can create expected document by using old 20.6 version of Aspose.Words.
  • Please also create a standalone simple Java application (source code without compilation errors) that helps us to reproduce your current problem on our end and attach it here for testing. Please do not include Aspose.Words JAR files in it to reduce the file size.

As soon as you get these pieces of information ready, we will start investigation into your issue and provide you more information.

Hi @awais.hafeez,
it took me a long time to pin it down, but here it is!
Inspired by recent support from @alexey.noskov I was able to work out a test case to demonstrate the problem.

The behaviour is different if the text to replace is inside a table or outside a table:

  • Within regular text with a replacement containing CR or NL generate new paragraph(s)
  • Within a table text with a replacement containing CR or NL will not generate proper new paragraph(s).

I think both variants should behave in the same way.
I actually had the phenomenon using a IReplacingCallback, but the effect is the same with a “regular” replace of with a IReplacingCallback.

The attached outcome is as generated with Aspose Words 22.5.
(actually with this test document I’m getting the same result with version 20.6, so there’s maybe some other aspect involved in my original business case that originally triggered this support request.)

@Test
void testReplaceWithNewLine() throws Exception {
	Document doc = new Document("C:\\Temp\\in.docx");

	Pattern p = Pattern.compile("\\{([\\w$]+)\\}");
	try {
		if (doc.getRange().replace(p, "test with CR\rwith NL\nwith CR+NL\r\nwith NL+CR\n\rThe end.") > 0) {
			System.out.println("replaced");
		}
	} catch (Exception e) {
		e.printStackTrace();
	}

	doc.save("C:\\Temp\\out.docx");
}

in.docx (11.8 KB)
out-actual.docx (9.7 KB)
out-expected.docx (12.0 KB)

@awais.hafeez Thank you for additional information. This is interesting one. Actually Aspose.Words behaves the same for both content in table and outside the table. If you unzip output document and inspect document.xml, you will see the following:

<w:p w:rsidR="00CC540C" w14:paraId="7CA404BB" w14:textId="1150A7E6">
	<w:r>
		<w:t>test with CR</w:t>
		<w:cr />
		<w:t>
			with NL
			with CR+NL
		</w:t>
		<w:cr />
		<w:t>
			with NL+CR
		</w:t>
		<w:cr />
		<w:t>The end.</w:t>
	</w:r>
</w:p>
<w:p w:rsidR="00677ACE" w14:paraId="5E45DDD2" w14:textId="395BB521" />
<w:tbl>
     .....................
	<w:tr w14:paraId="3EEFA81A" w14:textId="77777777" w:rsidTr="00677ACE">
		..........................
		<w:tc>
			...............................
			<w:p w:rsidR="00677ACE" w14:paraId="6B127A6D" w14:textId="0B424245">
				<w:r>
					<w:t>test with CR</w:t>
					<w:cr />
					<w:t>
						with NL
						with CR+NL
					</w:t>
					<w:cr />
					<w:t>
						with NL+CR
					</w:t>
					<w:cr />
					<w:t>The end.</w:t>
				</w:r>
			</w:p>
		</w:tc>
	</w:tr>
</w:tbl>

As you can see internal representation of content inside and outside the table is identical.

The actual problem is that using '\r' and '\n' characters in the replacement is not a good idea. If you need to have a line break in the replacement you can use either a soft line break '\u000b' character, or paragraph break - to achieve this you should use special "&p" metacharacter. So you can use either the following code:

doc.getRange().replace(p, "test with CR\u000bwith NL\u000bwith CR+NL\u000bwith NL+CR\u000bThe end.")

or

doc.getRange().replace(p, "test with CR&pwith NL&pwith CR+NL&pwith NL+CR&pThe end.")
1 Like

Thank you! This works! :slight_smile:
Am I correct that I should “escape” a regular & by two && in the replacement string? Especially if I want to prevent some accidental effects if my replacemant value is something like “alexey&partner”?

PS: I think I had the original effect with a .doc-file (not .docx) – and with that I had different results with an older and a newer Aspose version. As I have a workable solution now, and .doc-files are not XML-files, I’ll not dig into it any further but rejoice about the solution :slight_smile:
(I wonder why the title talks about docx though … mmm … two years is a long time :wink: …)

@DirkSteinkamp

Yes, your are absolutely correct. Also, you can use the meta-characters in both pattern and replacement strings. See Range.replace method remarks for more information.

I think the reason of the problem is the same - MS Word handles CR characters in Run’s content differently in the main body and in the table’s cell. Please, feel free to ask in case of any further issues, we will be glad to help you.

1 Like