Text Extraction Going Wrong with One File

Extracting text from the attached file shows text that isn’t in the document. It doesn’t happen on any other files, so I’d be interested to know if something is going wrong with Aspose.

I use code along these lines (not showing actual code but alternative code so you can replicate the issue):

NodeCollection nodeColl = doc.GetChildNodes(NodeType.Paragraph, true);
for (int i = 0; i <nodeColl.Count; i++)
{
    Paragraph para = (Paragraph) nodeColl[i];
    string strParagraphText = para.Range.Text
}

When that gets to paragraph number 101, the text should be “Conclusion”. Instead, it’s “Consultation Include views of relevant Overview and Scrutiny Committee, regulatory committee(s), Area Forum(s), Ward Member(s). Proposals relating to the budget and policy framework must include details of the nature and extent of consultation with stakeholders and relevant overview and scrutiny committees and outcome thereof.”

There are no comments, tracked changes, footnotes or other items that I can see which would cause this. Where can it be getting all that extra text from?

Thanks,

Daniel

Hi Daniel,

Thanks for your inquiry. I suggest you please sue the Node.ToString Method (SaveFormat). This method exports the content of the node into a string in the specified format.

Secondly, I have not found any paragraph with have text only ‘Conclusion’. Please see the attached image, the paragraph break mark is after text thereof. Hope this answers your query. Please let us know if you have any more queries.

Hi Tahir,
I’ve figured out the reason you are seeing different text to me. It’s because there is hidden text. That’s why Aspose is returning text that does not appear to be there in my document. If I show the hidden text then I get the same result as you.

Is there a way to stop hidden text from appearing in the Node.ToString (SaveFormat) method? I’m using Aspose 13.5 and the Node.ToString (SaveFormat) method works perfectly. However, it includes all hidden text and I would prefer things to appear as they do in the document (i.e. without hidden text).

Thanks,

Daniel

Hi Daniel,

Thanks for your inquiry. Please use the Font.Hidden Property to get the hidden text from the document. True if the font is formatted as hidden text.

In your case, I suggest you please first remove the hidden text from document and then get the text of each Paragraph as shown in following code snippet. Hope this helps you. Please let us know if you have any more queries.

Document doc = new Document(MyDir + "Hidden.doc");
foreach(Run run in doc.GetChildNodes(NodeType.Run, true))
{
    if (run.Font.Hidden)
        run.Remove();
}
foreach(Paragraph para in doc.GetChildNodes(NodeType.Paragraph, true))
{
    string text = para.ToString(SaveFormat.Text);
}

Perfect, thanks Tahir.

Cheers,

Daniel

Hi Daniel,

Thanks for your feedback. Please feel free to ask if you have any question about Aspose.Words, we will be happy to help you.