How to remove references from the Paragraph and get the plain Text

Hi,

I have a requirement to find unpaired punctuations from the document on each paragraph. But when I get the paragraph & paragraph text it is coming with reference and reference text. So, the text comes two times one in reference & another time in text.

Example-  TC “ REF _Ref435444913 \r 2 DEMISE, RENTS AND OTHER PAYMENTS” \l1 Demise, Rents and Other Payments

Text doesn’t have quotes in it but reference has so it is getting picked up.

Is there any way to get the plain text as paragraph or remove it?

@jainh3 You can use Document.UnlinkFields method to replace all fields in the document with simple text.
Also, you can unlink only particular types of fields. For example the following code unlinks all field except PAGE fields:

Document doc = new Document(@"C:\Temp\in.docx");

doc.Range.Fields.Where(f => f.Type!=FieldType.FieldPage).ToList()
    .ForEach(f => f.Unlink());

doc.Save(@"C:\Temp\out.docx");

If your goal is to get plain text of a particular paragraph, simply use ToString method:

string paraText = para.ToString(SaveFormat.Text).Trim();

In this case there will not be field codes in the returned string, only the field displayed text.

@alexey.noskov Thanks, it works.

1 Like

@alexey.noskov

Why did I still get this after ‘unlink’?

Normal should be 图 1.1

@Crane Could you please attach your input document here for our reference? We will check the issue and provide you more information.

Document doc = new Document("table title.docx");
FieldCollection fields = doc.getRange().getFields();
for (Field field : fields) {
    field.unlink();
}
Paragraph lastParagraph = doc.getFirstSection().getBody().getLastParagraph();
Assertions.assertEquals("表 一.1 表格", lastParagraph.toString(SaveFormat.TEXT));

table title.docx (17.1 KB)

@alexey.noskov

@Crane Thank you for additional information. As I can see the fields are unlinked properly. You can save the output to DOCX to check:

“表 一.1 表格”

I just want to get this text result from the paragraph, not the fieldCode

@Crane There is no field code in the output. Even if do not unlink fields, there is no field code in the string returned by the following code:

Document doc = new Document("C:\\Temp\\in.docx");
        
Paragraph lastParagraph = doc.getFirstSection().getBody().getLastParagraph();
String paraText = lastParagraph.toString(SaveFormat.TEXT).trim(); // Returns "表 一1 表格"

image.png (1.9 KB)

@alexey.noskov

The decimal point is garbled.
My need is to get visual plain text, because I need to do a matching operation.

@Crane There is no decimal point in the input document. There is non-breaking hyphen:

<w:noBreakHyphen/>

Is there a way to get the same content and get this test case to pass.

@Test
    void test5() throws Exception {
        Document doc = new Document("table title.docx");
        Paragraph lastParagraph = doc.getFirstSection().getBody().getLastParagraph();
        String paraText = lastParagraph.toString(SaveFormat.TEXT).trim();
        Assertions.assertEquals("表 一.1 表格", paraText);
    }

@Crane You can replace non-breaking hyphen with a period:

String paraText = lastParagraph.toString(SaveFormat.TEXT).trim().replace(ControlChar.NON_BREAKING_HYPHEN_CHAR, '.');

So there’s no way to get the right text, right, so I’ll have to enumerate these items to match.

@Crane Aspose.Words returns the text that is in the document. As I have mentioned din your document there is non-breaking hyphen. So Aspose.Words returns non-breaking hyphen in the text.

That is, in ms word that point is not a decimal point, but a ControlChar.NON_BREAKING_HYPHEN_CHAR?

@Crane Decimal point in MS Word is decimal point, and non-breaking hyphen is non-breaking hyphen. In your document there is non-breaking hyphen between STYLEREF and SEQ fields, and not decimal point.

Ok,Thank you for your patience :heart:

1 Like