How Get the text on each PDF page

sinfulmonk · July 21, 2007, 2:37pm

How do I obtain the extracted text on each page of a PDF file WITHOUT having to save it into a text file first? Please see the following codes:

PdfExtractor extractor = new PdfExtractor();
extractor.BindPdf(InputFile);
extractor.ExtractText();

//Save the extracted text to a text file
extractor.GetText(ExtractedTextFile);

At the step where I save the extracted text to a text file, I would like to view the string that stores the extracted text instead of saving it to a file. Please let me know if there's a way of doing it.

GeorgieYuan · July 21, 2007, 7:36pm

Hi,

You can using HasNextPageText() and GetNextPageText() to put each page 's text into a file or a stream. Code is just like this:

PdfExtractor extractor = new PdfExtractor();
extractor.BindPdf(InputFile);
extractor.ExtractText();

MemoryStream ms = new MemoryStream();
while (extractor.HasNextPageText())
{
extractor.GetNextPageText(ms);
}

Remember: After a cycle, you should copy the ms’s content to the string that you want to view them.

Best Regards.

sinfulmonk · July 21, 2007, 11:37pm

I can do Convert.ToBase64String(ms.ToArray()); to get the string, but I don’t know how to get the actual text that’s in the MemoryStream ms. Please kindly give me a sample code to get the actual text instead of the byte string, thanks.

GeorgieYuan · July 22, 2007, 4:19am

You can try this code:

-----------------------------------
MemoryStream ms = …
ms.Position = 0;

StreamReader sr = new StreamReader(ms, System.Text.Encoding.Default);
string result = “”;
string tempString = sr.ReadLine();

while (tempString != null)
{
result += tempString;
tempString = sr.ReadLine();
}

System.Console.WriteLine(result);
------------------------------------

Best Regards.

sinfulmonk · July 22, 2007, 10:05am

It works, thanks. Now I just need to know how to extract the text from a PDF file without the extra 9s prefix in the previous post (if you can also show me the dos command to use adobe to convert pdf to a text file, it’d be nice). Thanks again for the quick reply, you guys rock!!!

GeorgieYuan · July 22, 2007, 8:27pm

Hi,

We have tested it with the PDF you give us. We have found that the extra 9s prefix is the realy content in the PDF file! You can open the PDF file with Adobe Reader, select the 175.00 and the previous spacing and copy the selected content to a text editor, you will found that there is -9,999,999,999, before 175.00 ,although it’s not visiable.

The second quesion, as I know, Adobe don’t provide command line tools to extract text from PDF files.

Best Regards.