We're sorry Aspose doesn't work properply without JavaScript enabled.

Free Support Forum - aspose.com

Encodings and extracting text from PDFs


I am using Aspose.PDfF.Kit 2010.03.26 to extract text from PDF documents with the following code:

string tempFile = Path.GetTempFileName();

Aspose.Pdf.Kit.License lic = new Aspose.Pdf.Kit.License();
Aspose.Pdf.Kit.PdfExtractor extractor = new Aspose.Pdf.Kit.PdfExtractor();
extractor.BindPdf(@“Lucene Query Parser Syntax.pdf”);

using (StreamReader reader = new StreamReader(tempFile))
string text = reader.ReadToEnd();

The text is coming in with spaces between all characters (e.g. H e l l o instead of Hello). This was not the case when using Aspose.PDF.Kit 2009.8.10 . Has something changed in the PdfExtractor class in relation to encodings?

When I construct the StreamReader with an Encoding of Unicode the text comes in as before. Attached is sample PDF (but behaviour is the same for all pdfs I have tried)


Hi James,

I would like to inform you that the text extraction process was changed with the release of version 4.2.0. You can find the details and a sample on this link.

I hope this helps. If you still find any issue or have some more questions, please do let us know.

Thanks for that.