Best way to get body text from PDF

Hi there,

I am working on an application that need to get the most accurate and complete text representation of a PDF file. What is the best way (using Aspose, we have a licence for Aspose Total) that will take a pdf file and convert it to a complete string representation of all the text in the document? I am using C# .Net.

Thanks in advance!

Hi Aneumann,

I think you can use PdfExtractor class of Aspose.Pdf.kit component to extract text from a pdf file. You can get text from the pdf file and save it to either a text file or a stream object; which can further be used to manipulate as a string.

Following is the code sample that can give you an idea:

PdfExtractor extractor = new PdfExtractor();

extractor.BindPdf(Server.MapPath("~/App_Data/test.pdf"));

extractor.ExtractText();

extractor.GetText(Server.MapPath("~/App_Data/test.txt"));

I hope this helps; if you need any further help do let us know.

Regards,

Thanks for the reply. The PdfExtractor and GetText method is the best way i’ve found so far, but there are some drawbacks:

  1. GetText fails to handle non ascii characters (like copyright symbols, bullet points, chinese and japanese characters)
  2. GetText doesn’t get ALL text from the PDF; some captions and text within drawings are not returned.
For my application, I need to get every piece of text from a PDF in the order it appears in the document.

Thanks again!

Hi,<?xml:namespace prefix = o ns = "urn:schemas-microsoft-com:office:office" />

1) In current version of Aspose.Pdf.Kit for .Net, the extraction of non-English text is not supported. Our development team is working hard to support this feature, but I am afraid we cannot support it in short time. For more information please visit, Known Issues in Aspose.Pdf.Kit for .Net

In Aspose.Pdf.Kit for Java, the extraction of Unicode does not work well, but it supports the extraction of non-English text. For more information on limitations, please visit http://www.aspose.com/documentation/file-format-components/aspose.pdf.kit-for-.net-and-java/known-issues-in-aspose-pdf-kit-for-java.html

2) Please share some details regarding the drawing objects in your Pdf file. Currently Aspose.Pdf.Kit does not support the extraction of text from an Image. If its not the case, please share the Pdf file, so that we can test the issue at our end.

I can provide some sample files, but they are customer files. Have you got a secure FTP server and email address I can contact you at directly?

Thanks!

Hi,


You can upload files with in this forum thread and mark this
thread as private so that no one else other than Aspose Staff can access the
files. If you are still not satisfied while uploading the files, please visit
the following link for information on how to share the files with Aspose Staff
and send them to their mail accounts. How to send a license?


In case of any further query, please feel free to contact.