How to do word count for a PDF document with Aspose.Pdf?

victor.wangkai · July 25, 2016, 1:40am

Thanks!

codewarior · July 25, 2016, 7:13am

Hi Kai,

Thanks for contacting support.

In order to accomplish your requirements, please try using following code snippet.

//open document

Document pdfDocument = new Document();

pdfDocument.Pages.Add();

pdfDocument.Pages[1].Paragraphs.Add(new TextFragment("Hello World"));

pdfDocument.ProcessParagraphs();

System.Text.StringBuilder builder = new System.Text.StringBuilder();

//string to hold extracted text

string extractedText = "";


foreach (Page pdfPage in pdfDocument.Pages)

{

    using (MemoryStream textStream = new MemoryStream())

    {

        //create text device

        TextDevice textDevice = new TextDevice();


        //set text extraction options - set text extraction mode (Raw or Pure)

        Aspose.Pdf.Text.TextOptions.TextExtractionOptions textExtOptions = new

        Aspose.Pdf.Text.TextOptions.TextExtractionOptions(Aspose.Pdf.Text.TextOptions.TextExtractionOptions.TextFormattingMode.Pure);

        textDevice.ExtractionOptions = textExtOptions;


        //convert a particular page and save text to the stream

        textDevice.Process(pdfPage, textStream);


        //close memory stream

        textStream.Close();


        //get text from memory stream

        extractedText = Encoding.Unicode.GetString(textStream.ToArray());

    }

    builder.Append(extractedText);

}

// get the list of individual word with space as separator

IList<string> words = builder.ToString().Split(' ');

// print the count of words extracted from PDF file

Console.WriteLine(words.Count);

victor.wangkai · July 28, 2016, 11:48pm

Hi Nayyer,

Thanks!

We converted a Word file into Pdf (by using MS-Word “Save as PDF”), and tested it according to your solutions. There is still a problem. The word count by MS-Word is 2307 while the word count by aspose.pdf is 2496.

We discovered that your solution perhaps covers some invalid characters, as shown in the screenshot (see attachment: PDF Question).

I aslo attached the Word file and converted PDF file.

Our command lines are:

public void GetTotalWords(string filepath)
{

    Document pdfdoc = new Document(filepath);

    StringBuilder builder = new StringBuilder();

    string extractedText = “”;

    int totalwords = 0;

    foreach (Page pdfPage in pdfdoc.Pages)

    {

        using (MemoryStream textStream = new MemoryStream())

        {

            //create text device

            TextDevice textDevice = new TextDevice();



            //set text extraction options - set text extraction mode (Raw or Pure)

            Aspose.Pdf.Text.TextOptions.TextExtractionOptions textExtOptions = new

            Aspose.Pdf.Text.TextOptions.TextExtractionOptions(Aspose.Pdf.Text.TextOptions.TextExtractionOptions.TextFormattingMode.Pure);

            textDevice.ExtractionOptions = textExtOptions;



            //convert a particular page and save text to the stream

            textDevice.Process(pdfPage, textStream);



            //close memory stream

            textStream.Close();



            //get text from memory stream

            extractedText = Encoding.Unicode.GetString(textStream.ToArray());

        }

        //Method 1

        builder.Append(extractedText);



        // Method 2

        var text = TrimString(extractedText);

        if (text.Length > 0)

            totalwords += base.RuleHandler.GetWordCount(text);

    }

    // get the list of individual word with space as separator

    IList words = builder.ToString().Split(’ ');
    
// remove the space

    words = words.Where(i => !string.IsNullOrEmpty(i.Replace("\r\n", " “).Trim())).ToList();
    


//Method 1 Result

    var totalwords1 = words.Count;



    // Method 2 Result

    var totalwords2 = totalwords;

}

private string TrimString(string text)

{

    return text.Replace(”\v", “”).Replace("\a", “”).Trim();

}

public static int GetWordCount(string text)

{

    Regist();

    Document doc = new Document();

    DocumentBuilder builder = new DocumentBuilder(doc);

    builder.Write(text);

    doc.UpdateWordCount();

    return doc.BuiltInDocumentProperties.Words;

}

codewarior · August 1, 2016, 2:42am

Hi Kai,

Thanks for using our API’s.

I have tested the scenario and have managed to reproduce same problem. For the sake of correction, I have logged it as PDFNET-41222 in our issue tracking system. We will further look into the details of this problem and will keep you posted on the status of correction. Please be patient and spare us little time. We are sorry for this inconvenience.

victor.wangkai · October 28, 2016, 1:54am

Hi,

How is it going? We are expecting your solution…

codewarior · October 30, 2016, 3:26pm

Hi Kai,

Thanks for your patience.

I am afraid the earlier reported issue is still pending for review and is not yet resolved. However the product team will surely consider investigating/fixing it as per development schedule and as soon as we have some definite updates regarding its resolution, we will let you know. Please be patient and spare us little time. We are sorry for this delay and inconvenience.

anand2112 · October 29, 2020, 3:27am

Hi All,

I also tried to get wordcount of pdf .For me it is working however there is an issue that it is just calculating word count of only first line of pdf.
PFB code I am trying as suggested in above forum
Document pdfdoc = new Document(filepath);

        StringBuilder builder = new StringBuilder();

        string extractedText = "";

        int totalwords = 0;

        foreach (Page pdfPage in pdfdoc.Pages)

        {

            using (MemoryStream textStream = new MemoryStream())

            {



                TextDevice textDevice = new TextDevice();



   

                Aspose.Pdf.Text.TextExtractionOptions textExtOptions = new

                Aspose.Pdf.Text.TextExtractionOptions(Aspose.Pdf.Text.TextExtractionOptions.TextFormattingMode.Pure);

                textDevice.ExtractionOptions = textExtOptions;



                

                textDevice.Process(pdfPage, textStream);



                

                textStream.Close();



                

                extractedText = Encoding.Unicode.GetString(textStream.ToArray());

            }

             

            builder.Append(extractedText);




        }

         

        IList words = builder.ToString().Split(' ');





         

        var totalwords1 = words.Count;





    }


     
}

Is word count functionality supported by Aspose with Accuracy ?
Please suggest

asad.ali · October 29, 2020, 8:13pm

@anand2112

The API does not provide an actual word counting mechanism. In above code snippet, all text is extracted and then separated on the basis of spaces via string manipulation. However, would you please share your sample PDF document with us. We will test the scenario in our environment and address it accordingly.

anand2112 · October 30, 2020, 6:52am

PdfProof1.pdf (55.8 KB)
sample.pdf (3.0 KB)

Thanks asad .I am sharing files which I tried for wordcount. These are very basic pdf files with just some text.It is considering only first line.

asad.ali · November 1, 2020, 7:22pm

@anand2112

We tested the scenario in our environment and were unable to notice any issue. Would you kindly make sure that you are using a valid license while using the API. We used following code snippet for testing and correct word count was received:

Document pdfdoc = new Document(dataDir + "sample.pdf");

StringBuilder builder = new StringBuilder();
string extractedText = "";
int totalwords = 0;

foreach (Page pdfPage in pdfdoc.Pages)
{ 
 using (MemoryStream textStream = new MemoryStream())
 {
  TextDevice textDevice = new TextDevice();
  Aspose.Pdf.Text.TextExtractionOptions textExtOptions = new Aspose.Pdf.Text.TextExtractionOptions(Aspose.Pdf.Text.TextExtractionOptions.TextFormattingMode.Pure);
  textDevice.ExtractionOptions = textExtOptions;
  textDevice.Process(pdfPage, textStream);
  textStream.Close();
  extractedText = Encoding.Unicode.GetString(textStream.ToArray());
 }
 builder.Append(extractedText);
}
string[] words = builder.ToString().Split(new[] { '\r', '\n', ' ' }).Where(x => x != String.Empty).ToArray();
totalwords = words.Count();

anand2112 · November 2, 2020, 4:03am

Hi Asad we are using valid license .Current issue is it is taking wordcount of only first line .
Below is the extracted text it is considering
"Evaluation Only. Created with Aspose.PDF. Copyright 2002-2020 Aspose Pty Ltd.
Simple PDF File 2 " with word count 28.
However there is large amount of text additional in the file which is not extracted itself for word count consideration.

asad.ali · November 2, 2020, 6:48pm

@anand2112

We were unable to notice such behavior at our side. Would you kindly share a sample console application which is able to reproduce the issue. We will again test the scenario in our environment and share our feedback with you.