I need to read existing pdf files, extract the text (English and Chinese Traditional) and read the text data into an array and/or database using C#.
Can Pdf.kit to that?
If not which product can do that?
I wrote the aove message too quickly and should have added more information:
I’ve seen this in the Aspose Help pages:
PdfExtractor extractor = new PdfExtractor();
extractor.Password = “”;
My text file looks like this:
Warning:This is the evaluation version of Aspose.Pdf.Kit. The extraction will just be executed on parts of the pages. Please purchase your license to extract text correctly.Extracted with Aspose.Pdf.Kit Copyright 2002-2009 Aspose Pty Ltd.
1) I need to make sure utf8 characters are read correctly and can be used.
2) Ideally I need the data in an array [but I guess I could then read the text file into an array.]
The evaluation message appears due to the fact that a proper license file is not being used. Once you purchase a license file it’ll not appear any more. You can also get a temporary license for testing purposes.
As far the text extraction is concerned, please note that the text is extracted in raw format and can’t retain the formatting present in the PDF file. Regarding UTF-8 support is concerned, it is partially supported, please try the component at your end to make sure that it works for you.
I’m afraid that the extracted text can only be saved to a text file or stream, you can then use your code to move it to an array or some other object you like.
If you have any other questions, please do let me know.
1) Saving to a text file or stream won’t be a problem. Thanks
2) The evaluation message isn’t a problem either. And I’m very happy to buy Aspose once I know it can extract Chinese characters well.
3) Pdf.kit only offers partial UTF-8 support. What does that mean? What’s the limitation?
4) Can the pdf kit open files and extract the text from files that are copy locked: ie you can’t copy the text if the pdf file were opened with Acrobat.
5) Can it extract the text from password locked pdf files: even if the password isn’t supplied?
6) If the answer is “no” to 4 and/or 5. Would Aspose’s pdf control allow me to create a new pdf with the locks removed, and then I could extract the text frm these new pdf’s?
Please find the answers below:
2. Extraction of Chinese and Japanese characters is supported
3. Some characters might not be extracted well- you need to make sure that it works in your particular scenario; if you find any issues, do let us know and we’ll try to fix it.
4. No. currently it is not supported. I have logged this requirement as PDFKITNET-10788 in our issue tracking system.
5. No. currently it is not supported. I have logged this requirement as
PDFKITNET-10786 in our issue tracking system.
6. I’m afraid this is not possible either, I think you’ll have to wait for 4 & 5 if that is possible then you’ll be able to extract text. Our team will look into both of these requirements and see if this is possible for us to support these requirements. We’ll update you accordingly via this forum.
If you have any further questions, please do let us know.
I guess I’ll jhave to wait a long time for the new features.
We’ll try to provide new features the earliest possible; nevertheless, due to certain limitations we can’t provide you the ETA at the moment. Please spare us some time so our team could look into these requirements.
We appreciate your patience.