We're sorry Aspose doesn't work properply without JavaScript enabled.

Free Support Forum - aspose.com

Support for unicode text

Hi,

Does Aspose support text extraction from PDF/DOC/DOCX for unicode (e.g. Chinese, Thai, Arabic, Vietnamese, etc) text?

Hi,

Thanks for your inquiry. I’m a representative from the Aspose.Words team. Yes, Aspose.Words does support unicode text extraction from MS WORD documents (e.g DOC/DOCX files). I would suggest you please read the following article:

Moreover, you can save a Word document directly into TXT format by using the following code snippet:

Document doc = new
Document(@“c:\test\in.docx”);

TxtSaveOptions saveOptions = new TxtSaveOptions();

saveOptions.SaveFormat = SaveFormat.Text;

saveOptions.Encoding = Encoding.Unicode;

doc.Save(@“c:\test\out.txt”, saveOptions);

Please let us know if you need more information, we are always glad to help you.

Best Regards,

Hi!

Thanks for the quick reply! Good to know unicode is parsable for doc/docx files. How about PDF files?

Hello
Wei Li,


Thanks for your interest in our products.

I am pleased to share that Aspose.Pdf for .NET supports the capability to extract Chinese, Arabic etc text from PDF file. Please visit the following link for more information on Extract Text from all the Pages using Text Device

Please try using the latest release version of Aspose.Pdf for .NET 6.7.0 and in case you encounter any issue or you have any further query, please feel free to contact.

Hi!


Thanks for the reply again! Good to know that unicode extraction from both PDF and Word files are supported by Aspose. Just to confirm, the same functionality is available for the JAVA libraries as well?


Hi,


Thanks for your inquiry.

Yes, the same can be achieved by using the latest version of Aspose.Words for Java libraries. Also, to clarify you a bit, please note that the latest version of Aspose.Words for Java is completely auto-ported from .NET, i.e. we do not write code of Aspose.Words for Java; it is generated out automatically from C# code for Aspose.Words for .NET. So there should not be any significant difference in functionalities between Java and .NET versions because the code is mostly the same.

If we can help you with anything else, please feel free to ask.

Best Regards,

Hello Wei Li,

We have a product named Aspose.Pdf.Kit for Java which provides the capabilities to manipulate/edit existing PDF documents under Java Environment. It also provides the capability to extract Unicode text from PDF file. Please note that when extracting UniCode characters, you need to specify the encoding information. Please take a look over the following code snippet which can be used to accomplish this requirement. I would also suggest you to please have a look over Extract Text from PDF Document

[Java]

//Instantiate PdfExtractor object
com.aspose.pdf.kit.PdfExtractor extractor = new com.aspose.pdf.kit.PdfExtractor();
//Bind the input PDF document to extractor
extractor.bindPdf(“D:\pdftest\UniCodeText.pdf”);
//Extract text from the input PDF document using specific encoding
extractor.extractText(“UniCode”);
//Save the extracted text to a text file
extractor.getText(“D:\pdftest\ChineseText-text.txt”);
//close PdfExtractor object
extractor.close();

In the event of any further query, please feel free to contact. For your reference, I have also attached the sample PDF document and the resultant Text file that I have generated using Aspose.Pdf.Kit for Java 4.1.0.