How to remove references from the Paragraph and get the plain Text

jainh3 · June 17, 2024, 7:32am

Hi,

I have a requirement to find unpaired punctuations from the document on each paragraph. But when I get the paragraph & paragraph text it is coming with reference and reference text. So, the text comes two times one in reference & another time in text.

Example- TC “ REF _Ref435444913 \r 2 DEMISE, RENTS AND OTHER PAYMENTS” \l1 Demise, Rents and Other Payments

Text doesn’t have quotes in it but reference has so it is getting picked up.

Is there any way to get the plain text as paragraph or remove it?

alexey.noskov · June 17, 2024, 7:37am

@jainh3 You can use Document.UnlinkFields method to replace all fields in the document with simple text.
Also, you can unlink only particular types of fields. For example the following code unlinks all field except PAGE fields:

Document doc = new Document(@"C:\Temp\in.docx");

doc.Range.Fields.Where(f => f.Type!=FieldType.FieldPage).ToList()
    .ForEach(f => f.Unlink());

doc.Save(@"C:\Temp\out.docx");

If your goal is to get plain text of a particular paragraph, simply use ToString method:

string paraText = para.ToString(SaveFormat.Text).Trim();

In this case there will not be field codes in the returned string, only the field displayed text.

jainh3 · June 17, 2024, 8:51am

@alexey.noskov Thanks, it works.

Crane · March 25, 2025, 12:12pm

@alexey.noskov

Why did I still get this after ‘unlink’?

Normal should be 图 1.1

alexey.noskov · March 25, 2025, 12:18pm

@Crane Could you please attach your input document here for our reference? We will check the issue and provide you more information.

Crane · March 25, 2025, 1:01pm

Document doc = new Document("table title.docx");
FieldCollection fields = doc.getRange().getFields();
for (Field field : fields) {
    field.unlink();
}
Paragraph lastParagraph = doc.getFirstSection().getBody().getLastParagraph();
Assertions.assertEquals("表 一.1 表格", lastParagraph.toString(SaveFormat.TEXT));

table title.docx (17.1 KB)

@alexey.noskov

alexey.noskov · March 25, 2025, 1:14pm

@Crane Thank you for additional information. As I can see the fields are unlinked properly. You can save the output to DOCX to check:

Crane · March 25, 2025, 1:17pm

“表一.1 表格”

I just want to get this text result from the paragraph, not the fieldCode

alexey.noskov · March 25, 2025, 1:20pm

@Crane There is no field code in the output. Even if do not unlink fields, there is no field code in the string returned by the following code:

Document doc = new Document("C:\\Temp\\in.docx");
        
Paragraph lastParagraph = doc.getFirstSection().getBody().getLastParagraph();
String paraText = lastParagraph.toString(SaveFormat.TEXT).trim(); // Returns "表 一1 表格"

Crane · March 25, 2025, 1:30pm

image.png (1.9 KB)

@alexey.noskov

The decimal point is garbled.
My need is to get visual plain text, because I need to do a matching operation.

alexey.noskov · March 25, 2025, 1:32pm

@Crane There is no decimal point in the input document. There is non-breaking hyphen:

<w:noBreakHyphen/>

Crane · March 25, 2025, 1:34pm

Is there a way to get the same content and get this test case to pass.

@Test
    void test5() throws Exception {
        Document doc = new Document("table title.docx");
        Paragraph lastParagraph = doc.getFirstSection().getBody().getLastParagraph();
        String paraText = lastParagraph.toString(SaveFormat.TEXT).trim();
        Assertions.assertEquals("表 一.1 表格", paraText);
    }

alexey.noskov · March 25, 2025, 1:37pm

@Crane You can replace non-breaking hyphen with a period:

String paraText = lastParagraph.toString(SaveFormat.TEXT).trim().replace(ControlChar.NON_BREAKING_HYPHEN_CHAR, '.');

Crane · March 25, 2025, 1:39pm

So there’s no way to get the right text, right, so I’ll have to enumerate these items to match.

alexey.noskov · March 25, 2025, 1:41pm

@Crane Aspose.Words returns the text that is in the document. As I have mentioned din your document there is non-breaking hyphen. So Aspose.Words returns non-breaking hyphen in the text.

Crane · March 25, 2025, 1:43pm

That is, in ms word that point is not a decimal point, but a ControlChar.NON_BREAKING_HYPHEN_CHAR?

alexey.noskov · March 25, 2025, 1:45pm

@Crane Decimal point in MS Word is decimal point, and non-breaking hyphen is non-breaking hyphen. In your document there is non-breaking hyphen between STYLEREF and SEQ fields, and not decimal point.

Crane · March 25, 2025, 1:48pm

Ok,Thank you for your patience