PDF text extraction failing for Japanese characters

未コンファー�未コンファー ������� ��年�月�日����� ��������������������
��被検者の������閧�����
��年�月�日���� 心室レート � �� ���������������������������
女性 �間隔 �� � �����������������
�� �� ��� �� � ������
ルーム� ������ ������ � �������������
����Test_Action_FormatReportCompare_0927171415_3.1_ECG.pdf (55.3 KB)

@rmcdougall,
We have tested your source PDF with the latest version 17.9 of Aspose.Pdf for .NET API and the output text looks fine. Please try the following code:

[C#]

string dataDir = @"C:\Pdf\test357\";
// Open document
Document pdfDocument = new Document(dataDir + "Test_Action_FormatReportCompare_0927171415_3.1_ECG.pdf");

// Create TextAbsorber object to extract text
TextAbsorber textAbsorber = new TextAbsorber();
// Accept the absorber for all the pages
pdfDocument.Pages.Accept(textAbsorber);
// Get the extracted text
string extractedText = textAbsorber.Text;
// Create a writer and open the file
TextWriter tw = new StreamWriter(dataDir + "extracted-text.txt");
// Write a line of text to the file
tw.WriteLine(extractedText);
// Close the stream
tw.Close();

This is the output text file: extracted-text.zip (846 Bytes)

Wow, thanks for the quick response Rafique.

The c# code you provided is exactly what we are doing.
We are at Aspose version 9.9.0.0 however.

So will get a trial version of your latest release and test it out.
Will let you know how it goes.

Thanks again

@rmcdougall,
Sure, please let us know if you come across any problem with testing of the latest version 17.9. We recommend our clients to post their issues, proactively in the Aspose site forums.

@rmcdougall,

Adding more to Imran’s comments, when testing the trial version/non licensed, there are some limitations while manipulating elements inside the document. In case your license is not valid for latest release, you may consider requesting a 30 days temporary license to test the API without any limitations.

What is the process to upgrade to the latest version? We actually have 2 issues in are current 9.9 licensed version that is fixed by your latest software.
Thanks

@rmcdougall,

Please open your license file with any editor, e.g. notepad and check the license expiry date on the subscription expiry tag:

e.g.
expiry date: 20110218
It means that you can free upgrade to version of Aspose published before 02/18/2011.

If the license is not expired, then you need to replace the latest Aspose.PDF DLL with old one. We have documented release notes of each Aspose.PDF API version. Please refer to this help section: Release Notes of Aspose.PDF API

We have updated the software and Japanese is now extracting correctly. Thanks!
We now are having an issue when trying to extract Chinese simplified text.

We get the exception:
Cannot find resource ‘Aspose.Pdf.src.CommonData.Text.CMaps.PredefinedCMaps.GB-EUC-H’.

The pdf doc contains:
12 0 obj
<</Type/Font/Subtype/Type0/BaseFont/AdobeSTSongStd-Light/Encoding/GB-EUC-H/DescendantFonts[13 0 R]>>
endobj

thought you might know right off what the issue is.
Thanks for your help.

@rmcdougall

Thanks for your feedback.

Would you please share your sample PDF document with us. We will test the scenario in our environment and address it accordingly.