Different Language Encoding in PDF file and MSword Files

aaiti · March 31, 2011, 4:23am

Can ASPOSE extract text in Chinese, Vietnamese, Thai from PDF and MsWord file correctly?

alexey.noskov · March 31, 2011, 11:35am

Hi

Thanks for your request. I am a representative of the Aspose.Words team. Yup you can extract text in different languages from MS Word documents using Aspose.Words. Please follow the link to learn how to extract text: Aspose.Words text extraction.

Best regards,

shahzadlatif · April 1, 2011, 2:29am

Hi,

I’m a representative of Aspose.Pdf.Kit and I would like to share with you that Aspose.Pdf.Kit for Java allows you to extract text in different languages from a PDF file. You may find a sample in this article. Please download the latest version and try at your end. Please also note that text extraction would be quite limited in the evaluation mode, so you may get a temporary license for 30 days for complete testing of this feature.

I hope this helps. If you find any issues or further questions then please do let us know.
Regards,

natnapaporn.thin · July 7, 2011, 11:54pm

Could Aspose.Pdf.Kit for .NET allow us to extract text in different languages from a PDF file as well?

We’ve tried a sample in .NET article ([Aspose.Total for .NET|Documentation]), but it doesn’t work with the Arabic language.

Sample code:

//create an instance of PdfExtractor class
Aspose.Pdf.Kit.PdfExtractor extractor = new Aspose.Pdf.Kit.PdfExtractor();

//bind PDF file with the extractor object
extractor.BindPdf(@"D:\Text\text.pdf");

//extract all text from the PDF
extractor.ExtractText(Encoding.UTF8);

//save extracted text in a text file
extractor.GetText(@"D:\Text\text.txt");
//end of sample code

Remark: here is product info.

Product name: Aspose.Pdf.Kit

File version: 5.5.0.0
License: Aspose.Total

shahzadlatif · July 8, 2011, 8:47am

Hi Nuch,

First of all, please try to use the code given in the method 2 on this page.

Secondly, we have released merged Aspose.PDF for .NET which contains features of both Aspose.Pdf for .NET and Aspose.Pdf.Kit for .NET. You only need to reference the Aspose.Pdf.Facades namespace in order to use the same features available in Aspose.Pdf.Kit for .NET.

Please try it at your end and see if it helps. If you find any further questions, please do let us know.
Regards,

natnapaporn.thin · July 11, 2011, 11:05am

Thamks for your support.

However, I've followed both steps already. It doesn't work.

Sample Code:
// use Aspose.Pdf.dll (v.6.0.0)
using Aspose.Pdf.Facades;

//open input PDF
PdfExtractor pdfExtractor = new PdfExtractor();
pdfExtractor.BindPdf(“input.pdf”);
//use parameterless ExtractText method
pdfExtractor.ExtractText();
MemoryStream tempMemoryStream = new MemoryStream();
pdfExtractor.GetText(tempMemoryStream);
string text = "";
//specify Unicode encoding type in StreamReader constructor
using (StreamReader streamReader = new StreamReader(tempMemoryStream, Encoding.Unicode))
{
streamReader.BaseStream.Seek(0, SeekOrigin.Begin);
text = streamReader.ReadToEnd();

}
File.WriteAllText(”output.txt”, text,Encoding.UTF8);

Error#1:
Message = "Inflating error. Please check the following message: incorrect data check"
Source = "Aspose.Pdf"
StackTrace = "at Aspose.Pdf.Engine.Filters.Impls.FlateDecode.ThirdParty.ZInflaterInputStream.Read(Byte[] b, Int32 off, Int32 len)
at Aspose.Pdf.Engine.Filters.Impls.FlateDecode.FlateDecode.Decode(Byte[] data, Object[] parameters)
at Aspose.Pdf.Engine.Filters.CompositeDecoder.Decode(Byte[] data, Object[] parameters)
at Aspose.Pdf.Engine.Data.Types.PdfStreamAccessor.get_DecodedData()
at Aspose.Pdf.Engine.CommonData.PageContent.ContentBuilder.InitWriteContent()
at Aspose.Pdf.Engine.CommonData.PageContent.ContentBuilder..ctor(IPage page)
at Aspose.Pdf.Engine.Factory.PdfInternalFactory.CreateContentBuilder(IPage page)
at Aspose.Pdf.Engine.CommonData.PageTreeNode.get_ContentBuilder()
at Aspose.Pdf.Resources..ctor(Object parent)
at Aspose.Pdf.Page..ctor(IPage page)
at Aspose.Pdf.PageCollection.get(Int32 index)
at Aspose.Pdf.PageCollection.get_Item(Int32 index)
at Aspose.Pdf.Facades.PdfExtractor.ExtractText(Encoding encoding)
at Aspose.Pdf.Facades.PdfExtractor.ExtractText()"

Error#2:
Message = "Value was either too large or too small for an Int32."
Source = "mscorlib"
StackTrace = " at System.Number.ParseInt32(String s, NumberStyles style, NumberFormatInfo info)
at System.Int32.Parse(String s, NumberStyles style)
at Aspose.Pdf.Engine.CommonData.Text.Encoding.CMapEncoding.Decode(Char value)
at Aspose.Pdf.Engine.CommonData.Text.Encoding.PdfFontEncoding.CIDFontEncodingBase.PdfBytesToUnicode(String value)
at Aspose.Pdf.Engine.CommonData.Text.Encoding.PdfFontEncoding.PdfFontEncodingBase.Decode(String value)
at Aspose.Pdf.Engine.Data.PdfString.get_ExtractedString()
at Aspose.Pdf.Engine.CommonData.Text.Segmenting.PhysicalTextSegment.DecodeString(IPdfString iPdfString)
at Aspose.Pdf.Engine.CommonData.Text.Segmenting.ArrayTextSegment.OnParametersInitialized()
at Aspose.Pdf.Engine.CommonData.Text.Segmenting.ArrayTextSegment..ctor(OperatorLink opLink, OperatorLink firstBlockOpLink, TextSegmentBuilder segmentBuilder, IResourceDictionary resources, Double xIndent, Double yIndent, PhysicalTextState textState)
at Aspose.Pdf.Engine.CommonData.Text.Segmenting.PhysicalTextSegment.CreateTextSegment(OperatorLink opLink, OperatorLink firstBlockOpLink, TextSegmentBuilder segmentBuilder, IResourceDictionary resources, Double xIndent, Double yIndent, PhysicalTextState textState)
at Aspose.Pdf.Engine.CommonData.Text.Segmenting.TextSegmentBuilder.AddPhysicalSegment(Int32 opIndex, Int32 btOpIndex, IPageOperator op, PhysicalTextState textState)
at Aspose.Pdf.Engine.CommonData.Text.Segmenting.TextSegmentBuilder.AddTextSegment(OperatorLink opLink)
at Aspose.Pdf.Engine.CommonData.Text.Segmenting.TextSegmentBuilder.TJ(Int32 opIndex, IPageOperator op)
at Aspose.Pdf.Engine.CommonData.Text.Segmenting.TextSegmentBuilder.Parse()
at Aspose.Pdf.Engine.CommonData.Text.Segmenting.TextSegmenter.BuildPageSegments(Queue commandQueue, IPdfPrimitive contentStream, IResourceDictionary resources)
at Aspose.Pdf.Engine.CommonData.Text.Segmenting.TextSegmenter.BuildPageSegments(IPdfPrimitive contents, IResourceDictionary resources)
at Aspose.Pdf.Engine.CommonData.Text.Segmenting.TextSegmenter.BuildSegments()
at Aspose.Pdf.Engine.CommonData.Text.Segmenting.TextSegmenter..ctor(IPage page)
at Aspose.Pdf.Text.TextAbsorber.Visit(Page page)
at Aspose.Pdf.Facades.PdfExtractor.ExtractText(Encoding encoding)
at Aspose.Pdf.Facades.PdfExtractor.ExtractText()"

shahzadlatif · July 12, 2011, 2:52am

Hi
Nuch,

Please share the input PDF file with us, so we could investigate your issue at our end. You’ll be updated with the results accordingly.

We’re sorry for the inconvenience.
Regards,

natnapaporn.thin · July 12, 2011, 4:08am

Hi Shahzad,

We’d please to share you a bunch of our sample files with ok and error.
Please download files from these following links:

http://www.languagestudio.com/downloads/pdf/sample-error.zip
http://www.languagestudio.com/downloads/pdf/sample-ok.zip

Thank you in advance for your quick support,
Nuch

shahzadlatif · July 13, 2011, 1:48am

Hi Nuch,

I have reproduced both of these issues at my end and logged them in our issue tracking system as follows:

PDFNEWNET-29110 - Sample-Arabic.pdf
PDFNEWNET-29111 - Sample-Arabic-2.pdf

Our team will look into these issues and you’ll be updated via this forum thread once they’re resolved.

We’re sorry for the inconvenience.
Regards,

aspose.notifier · July 17, 2012, 11:25am

The issues you have found earlier (filed as PDFNEWNET-29110;PDFNEWNET-29111) have been fixed in this update.

This message was posted using Notification2Forum from Downloads module by aspose.notifier.