GetWordCount obsolete

In the overview of your documentation you state that "GetWordCount()" is obsolete, but you don't really mention with what it has been replaced?

In any case, I tried it out (actually on your invoice) and got 18 Words as the result, which is not exactly correct.

Any hints?

Hi,

Thank you for considering Aspose.

I have checked PdfExtractor.GetWordCount() method, its working fine. About the correct count of words is may be due to evaluation version. Its add some garbage characters as well. Should you have any problems then do let us know.

Thanks.

Adeel Ahmad
Support Developer
Aspose Changsha Team
http://www.aspose.com/Wiki/default.aspx/Aspose.Corporate/ContactChangsha.html

It is difficult to get word count for some of the languages (such as Chinese) so we give up this feature. You can extract the text and calculate the word count by yourself or use third party tools.

Maybe it would help to write that into your wiki.

Here is my solution, maybe it saves someone some time:

static private int CountInPDF(MemoryStream stream){

//Instantiate PdfExtractor object

PdfExtractor extractor = new PdfExtractor();

//Bind the input PDF document to extractor

extractor.BindPdf(stream);

//Extract text from the PDF document

extractor.ExtractText();

//extractor.GetText(@"C:\tmp\text.txt");

MemoryStream mem = new MemoryStream();

extractor.GetText(mem);

StreamReader reader = new StreamReader(mem);

mem.Seek(0, SeekOrigin.Begin);

string text = reader.ReadToEnd();

//Call GetWordCount method to get word count of the input PDF file

return CountWordsInString(text);

}

static public int CountWordsInString(string text){

//adjust list to fit your needs

char[] charsToSplit = new char[]{' ', ':', ';', ',', '.', '-', '\r', '\n', '\t'};

//Split function does pretty much all the work :-)

string[] tmp = text.Split(charsToSplit, StringSplitOptions.RemoveEmptyEntries);

return tmp.Length;

}