How to do word count for a PDF document with Aspose.Pdf?
Thanks!
How to do word count for a PDF document with Aspose.Pdf?
Thanks!
Hi Kai,
Thanks for contacting support.
In order to accomplish your requirements, please try using following code snippet.
//open document
Document pdfDocument = new Document();
pdfDocument.Pages.Add();
pdfDocument.Pages[1].Paragraphs.Add(new TextFragment("Hello World"));
pdfDocument.ProcessParagraphs();
System.Text.StringBuilder builder = new System.Text.StringBuilder();
//string to hold extracted text
string extractedText = "";
foreach (Page pdfPage in pdfDocument.Pages)
{
using (MemoryStream textStream = new MemoryStream())
{
//create text device
TextDevice textDevice = new TextDevice();
//set text extraction options - set text extraction mode (Raw or Pure)
Aspose.Pdf.Text.TextOptions.TextExtractionOptions textExtOptions = new
Aspose.Pdf.Text.TextOptions.TextExtractionOptions(Aspose.Pdf.Text.TextOptions.TextExtractionOptions.TextFormattingMode.Pure);
textDevice.ExtractionOptions = textExtOptions;
//convert a particular page and save text to the stream
textDevice.Process(pdfPage, textStream);
//close memory stream
textStream.Close();
//get text from memory stream
extractedText = Encoding.Unicode.GetString(textStream.ToArray());
}
builder.Append(extractedText);
}
// get the list of individual word with space as separator
IList<string> words = builder.ToString().Split(' ');
// print the count of words extracted from PDF file
Console.WriteLine(words.Count);
Hi Nayyer,
Thanks!
We converted a Word file into Pdf (by using MS-Word “Save as PDF”), and tested it according to your solutions. There is still a problem. The word count by MS-Word is 2307 while the word count by aspose.pdf is 2496.
We discovered that your solution perhaps covers some invalid characters, as shown in the screenshot (see attachment: PDF Question).
I aslo attached the Word file and converted PDF file.
Our command lines are:
public void GetTotalWords(string filepath)
{
Document pdfdoc = new Document(filepath);
StringBuilder builder = new StringBuilder();
string extractedText = “”;
int totalwords = 0;
foreach (Page pdfPage in pdfdoc.Pages)
{
using (MemoryStream textStream = new MemoryStream())
{
//create text device
TextDevice textDevice = new TextDevice();
//set text extraction options - set text extraction mode (Raw or Pure)
Aspose.Pdf.Text.TextOptions.TextExtractionOptions textExtOptions = new
Aspose.Pdf.Text.TextOptions.TextExtractionOptions(Aspose.Pdf.Text.TextOptions.TextExtractionOptions.TextFormattingMode.Pure);
textDevice.ExtractionOptions = textExtOptions;
//convert a particular page and save text to the stream
textDevice.Process(pdfPage, textStream);
//close memory stream
textStream.Close();
//get text from memory stream
extractedText = Encoding.Unicode.GetString(textStream.ToArray());
}
//Method 1
builder.Append(extractedText);
// Method 2
var text = TrimString(extractedText);
if (text.Length > 0)
totalwords += base.RuleHandler.GetWordCount(text);
}
// get the list of individual word with space as separator
IList words = builder.ToString().Split(’ ');
// remove the space
words = words.Where(i => !string.IsNullOrEmpty(i.Replace("\r\n", " “).Trim())).ToList();
//Method 1 Result
var totalwords1 = words.Count;
// Method 2 Result
var totalwords2 = totalwords;
}
private string TrimString(string text)
{
return text.Replace(”\v", “”).Replace("\a", “”).Trim();
}
public static int GetWordCount(string text)
{
Regist();
Document doc = new Document();
DocumentBuilder builder = new DocumentBuilder(doc);
builder.Write(text);
doc.UpdateWordCount();
return doc.BuiltInDocumentProperties.Words;
}
Hi Kai,
Hi,
How is it going? We are expecting your solution…
Hi Kai,
Hi All,
I also tried to get wordcount of pdf .For me it is working however there is an issue that it is just calculating word count of only first line of pdf.
PFB code I am trying as suggested in above forum
Document pdfdoc = new Document(filepath);
StringBuilder builder = new StringBuilder();
string extractedText = "";
int totalwords = 0;
foreach (Page pdfPage in pdfdoc.Pages)
{
using (MemoryStream textStream = new MemoryStream())
{
TextDevice textDevice = new TextDevice();
Aspose.Pdf.Text.TextExtractionOptions textExtOptions = new
Aspose.Pdf.Text.TextExtractionOptions(Aspose.Pdf.Text.TextExtractionOptions.TextFormattingMode.Pure);
textDevice.ExtractionOptions = textExtOptions;
textDevice.Process(pdfPage, textStream);
textStream.Close();
extractedText = Encoding.Unicode.GetString(textStream.ToArray());
}
builder.Append(extractedText);
}
IList words = builder.ToString().Split(' ');
var totalwords1 = words.Count;
}
}
Is word count functionality supported by Aspose with Accuracy ?
Please suggest
The API does not provide an actual word counting mechanism. In above code snippet, all text is extracted and then separated on the basis of spaces via string manipulation. However, would you please share your sample PDF document with us. We will test the scenario in our environment and address it accordingly.
PdfProof1.pdf (55.8 KB)
sample.pdf (3.0 KB)
Thanks asad .I am sharing files which I tried for wordcount. These are very basic pdf files with just some text.It is considering only first line.
We tested the scenario in our environment and were unable to notice any issue. Would you kindly make sure that you are using a valid license while using the API. We used following code snippet for testing and correct word count was received:
Document pdfdoc = new Document(dataDir + "sample.pdf");
StringBuilder builder = new StringBuilder();
string extractedText = "";
int totalwords = 0;
foreach (Page pdfPage in pdfdoc.Pages)
{
using (MemoryStream textStream = new MemoryStream())
{
TextDevice textDevice = new TextDevice();
Aspose.Pdf.Text.TextExtractionOptions textExtOptions = new Aspose.Pdf.Text.TextExtractionOptions(Aspose.Pdf.Text.TextExtractionOptions.TextFormattingMode.Pure);
textDevice.ExtractionOptions = textExtOptions;
textDevice.Process(pdfPage, textStream);
textStream.Close();
extractedText = Encoding.Unicode.GetString(textStream.ToArray());
}
builder.Append(extractedText);
}
string[] words = builder.ToString().Split(new[] { '\r', '\n', ' ' }).Where(x => x != String.Empty).ToArray();
totalwords = words.Count();
Hi Asad we are using valid license .Current issue is it is taking wordcount of only first line .
Below is the extracted text it is considering
"Evaluation Only. Created with Aspose.PDF. Copyright 2002-2020 Aspose Pty Ltd.
Simple PDF File 2 " with word count 28.
However there is large amount of text additional in the file which is not extracted itself for word count consideration.
We were unable to notice such behavior at our side. Would you kindly share a sample console application which is able to reproduce the issue. We will again test the scenario in our environment and share our feedback with you.