Can ASPOSE extract text in Chinese, Vietnamese, Thai from PDF and MsWord file correctly?
Hi
Thanks for your request. I am a representative of the Aspose.Words team. Yup you can extract text in different languages from MS Word documents using Aspose.Words. Please follow the link to learn how to extract text: Aspose.Words text extraction.
Best regards,
Hi,
I’m a representative of Aspose.Pdf.Kit and I would like to share with you that Aspose.Pdf.Kit for Java allows you to extract text in different languages from a PDF file. You may find a sample in this article. Please download the latest version and try at your end. Please also note that text extraction would be quite limited in the evaluation mode, so you may get a temporary license for 30 days for complete testing of this feature.
I hope this helps. If you find any issues or further questions then please do let us know.
Regards,
Could Aspose.Pdf.Kit for .NET allow us to extract text in different languages from a PDF file as well?
We’ve tried a sample in .NET article ([Aspose.Total for .NET|Documentation]), but it doesn’t work with the Arabic language.
Sample code:
//create an instance of PdfExtractor class
Aspose.Pdf.Kit.PdfExtractor extractor = new Aspose.Pdf.Kit.PdfExtractor();
//bind PDF file with the extractor object
extractor.BindPdf(@"D:\Text\text.pdf");
//extract all text from the PDF
extractor.ExtractText(Encoding.UTF8);
//save extracted text in a text file
extractor.GetText(@"D:\Text\text.txt");
//end of sample code
Remark: here is product info.
Product name: Aspose.Pdf.Kit
File version: 5.5.0.0
License: Aspose.Total
Hi Nuch,
First of all, please try to use the code given in the method 2 on this page.
Secondly, we have released merged Aspose.PDF for .NET which contains features of both Aspose.Pdf for .NET and Aspose.Pdf.Kit for .NET. You only need to reference the Aspose.Pdf.Facades namespace in order to use the same features available in Aspose.Pdf.Kit for .NET.
Please try it at your end and see if it helps. If you find any further questions, please do let us know.
Regards,
Thamks for your support.
However, I've followed both steps already. It doesn't work.
Sample Code:
// use Aspose.Pdf.dll (v.6.0.0)
using Aspose.Pdf.Facades;
//open input PDF
PdfExtractor pdfExtractor = new PdfExtractor();
pdfExtractor.BindPdf(“input.pdf”);
//use parameterless ExtractText method
pdfExtractor.ExtractText();
MemoryStream tempMemoryStream = new MemoryStream();
pdfExtractor.GetText(tempMemoryStream);
string text = "";
//specify Unicode encoding type in StreamReader constructor
using (StreamReader streamReader = new StreamReader(tempMemoryStream, Encoding.Unicode))
{
streamReader.BaseStream.Seek(0, SeekOrigin.Begin);
text = streamReader.ReadToEnd();
}
File.WriteAllText(”output.txt”, text,Encoding.UTF8);
Error#1:
Message = "Inflating error. Please check the following message: incorrect data check"
Source = "Aspose.Pdf"
StackTrace = "at Aspose.Pdf.Engine.Filters.Impls.FlateDecode.ThirdParty.ZInflaterInputStream.Read(Byte[] b, Int32 off, Int32 len)
at Aspose.Pdf.Engine.Filters.Impls.FlateDecode.FlateDecode.Decode(Byte[] data, Object[] parameters)
at Aspose.Pdf.Engine.Filters.CompositeDecoder.Decode(Byte[] data, Object[] parameters)
at Aspose.Pdf.Engine.Data.Types.PdfStreamAccessor.get_DecodedData()
at Aspose.Pdf.Engine.CommonData.PageContent.ContentBuilder.InitWriteContent()
at Aspose.Pdf.Engine.CommonData.PageContent.ContentBuilder..ctor(IPage page)
at Aspose.Pdf.Engine.Factory.PdfInternalFactory.CreateContentBuilder(IPage page)
at Aspose.Pdf.Engine.CommonData.PageTreeNode.get_ContentBuilder()
at Aspose.Pdf.Resources..ctor(Object parent)
at Aspose.Pdf.Page..ctor(IPage page)
at Aspose.Pdf.PageCollection.get(Int32 index)
at Aspose.Pdf.PageCollection.get_Item(Int32 index)
at Aspose.Pdf.Facades.PdfExtractor.ExtractText(Encoding encoding)
at Aspose.Pdf.Facades.PdfExtractor.ExtractText()"
Error#2:
Message = "Value was either too large or too small for an Int32."
Source = "mscorlib"
StackTrace = " at System.Number.ParseInt32(String s, NumberStyles style, NumberFormatInfo info)
at System.Int32.Parse(String s, NumberStyles style)
at Aspose.Pdf.Engine.CommonData.Text.Encoding.CMapEncoding.Decode(Char value)
at Aspose.Pdf.Engine.CommonData.Text.Encoding.PdfFontEncoding.CIDFontEncodingBase.PdfBytesToUnicode(String value)
at Aspose.Pdf.Engine.CommonData.Text.Encoding.PdfFontEncoding.PdfFontEncodingBase.Decode(String value)
at Aspose.Pdf.Engine.Data.PdfString.get_ExtractedString()
at Aspose.Pdf.Engine.CommonData.Text.Segmenting.PhysicalTextSegment.DecodeString(IPdfString iPdfString)
at Aspose.Pdf.Engine.CommonData.Text.Segmenting.ArrayTextSegment.OnParametersInitialized()
at Aspose.Pdf.Engine.CommonData.Text.Segmenting.ArrayTextSegment..ctor(OperatorLink opLink, OperatorLink firstBlockOpLink, TextSegmentBuilder segmentBuilder, IResourceDictionary resources, Double xIndent, Double yIndent, PhysicalTextState textState)
at Aspose.Pdf.Engine.CommonData.Text.Segmenting.PhysicalTextSegment.CreateTextSegment(OperatorLink opLink, OperatorLink firstBlockOpLink, TextSegmentBuilder segmentBuilder, IResourceDictionary resources, Double xIndent, Double yIndent, PhysicalTextState textState)
at Aspose.Pdf.Engine.CommonData.Text.Segmenting.TextSegmentBuilder.AddPhysicalSegment(Int32 opIndex, Int32 btOpIndex, IPageOperator op, PhysicalTextState textState)
at Aspose.Pdf.Engine.CommonData.Text.Segmenting.TextSegmentBuilder.AddTextSegment(OperatorLink opLink)
at Aspose.Pdf.Engine.CommonData.Text.Segmenting.TextSegmentBuilder.TJ(Int32 opIndex, IPageOperator op)
at Aspose.Pdf.Engine.CommonData.Text.Segmenting.TextSegmentBuilder.Parse()
at Aspose.Pdf.Engine.CommonData.Text.Segmenting.TextSegmenter.BuildPageSegments(Queue commandQueue, IPdfPrimitive contentStream, IResourceDictionary resources)
at Aspose.Pdf.Engine.CommonData.Text.Segmenting.TextSegmenter.BuildPageSegments(IPdfPrimitive contents, IResourceDictionary resources)
at Aspose.Pdf.Engine.CommonData.Text.Segmenting.TextSegmenter.BuildSegments()
at Aspose.Pdf.Engine.CommonData.Text.Segmenting.TextSegmenter..ctor(IPage page)
at Aspose.Pdf.Text.TextAbsorber.Visit(Page page)
at Aspose.Pdf.Facades.PdfExtractor.ExtractText(Encoding encoding)
at Aspose.Pdf.Facades.PdfExtractor.ExtractText()"
Hi
Nuch,
Please share the input PDF file with us, so we could investigate your issue at our end. You’ll be updated with the results accordingly.
We’re sorry for the inconvenience.
Regards,
Hi Shahzad,
We’d please to share you a bunch of our sample files with ok and error.
Please download files from these following links:
http://www.languagestudio.com/downloads/pdf/sample-error.zip
http://www.languagestudio.com/downloads/pdf/sample-ok.zip
Thank you in advance for your quick support,
Nuch
Hi Nuch,
I have reproduced both of these issues at my end and logged them in our issue tracking system as follows:
PDFNEWNET-29110 - Sample-Arabic.pdf
PDFNEWNET-29111 - Sample-Arabic-2.pdf
Our team will look into these issues and you’ll be updated via this forum thread once they’re resolved.
We’re sorry for the inconvenience.
Regards,
The issues you have found earlier (filed as PDFNEWNET-29110;PDFNEWNET-29111) have been fixed in this update.
This message was posted using Notification2Forum from Downloads module by aspose.notifier.