Textextraction produces garbage

Hi to everyone of the support team,

the pdf text extraction method produces garbage when used with the attached pdf document.


Platform:
Windows 7 (64 Bit)
dotnet 2.0
Aspose.Pdf.Kit. 5.2.0.0

Example code:

static void Main(string[] args)
{
string strSourceFile = @“c:\temp\Test1.pdf”;
string strDestFile = @“c:\temp\Test1.txt”;

Aspose.Pdf.Kit.License license = new Aspose.Pdf.Kit.License();
license.SetLicense(“Aspose.Total.lic”);

PdfExtractor objPdfExtr = new PdfExtractor();

objPdfExtr.BindPdf(strSourceFile);
objPdfExtr.ExtractTextMode = 0;
objPdfExtr.ExtractText();
objPdfExtr.GetText(strDestFile);


Thanks in advance, Martin

internal bugid#4014

Hi Martin,

I have reproduced this problem at my end and logged it as PDFKITNET-24064 in our issue tracking system. Our team will look into this issue and you’ll be updated via this forum thread once it is resolved.

We’re sorry for the inconvenience.
Regards,

Hello Shahzad,

the problem still exists with Aspose.Pdf.Kit 5.8.0.0.

Best regards, Martin

Hi Martin,

I’m sorry to share with you that this issue is not yet resolved. However, I will contact our development team regarding the ETA of this issue and you’ll be updated accordingly.

We’re sorry for the inconvenience.
Regards,

Hi,

i just tested the issue with the latest verision of Aspose.Pdf 6.4.2.0. It’s still not solved.

I tried the Aspose.Pdf.Facades.PdfExtractor and the new Aspose.Pdf.Text.TextAbsorber. Both extracting methods only produces garbage (see 11-01-27.txt).




Hi Martin,

I’m sorry to share with you that this issue is not yet resolved. However, our team is working on this issue and you’ll be updated with the status the earliest possible.

We’re sorry for the inconvenience.
Regards,

Hi,

i just tested text extraction with Aspose.Pdf.dll 6.6.0 and ist still not working.

The extracted text look like

8
. 9
9 ,9




( "

aG7
; He*;))A



$&$;
$29





".1B9
%
$2C%;a2I
$$2



"

& a
$11223/&2





C

$27%!++7&=
74



8
. 9


Best regards, Martin

Hi Martin,<?xml:namespace prefix = o ns = "urn:schemas-microsoft-com:office:office" />

Thank you for your patience.

I tested your issue with the latest version of Aspose.Pdf for .NET v6.6 and your issue still exist in the latest merge version of Aspose.Pdf for .NET. I have asked the development team to share the ETA of your reported issue on high priority. I will update you via this forum thread as soon as I get an update from the development team.

Sorry for the inconvenience,

Hi,

i just tested the new Aspose.Pdf.dll 6.7.0.0 and the problem ist still unresolved.

Best regards, Martin

Hi Martin,<?xml:namespace prefix = o ns = "urn:schemas-microsoft-com:office:office" />

I would like to update you that your reported issue (regarding Garbage Text Extraction) has been fixed in our upcoming latest version of Aspose.Pdf for .NET v6.8 (will be release after testing in first half of this month). As per our initial testing, the extraction result is much closer to that of Adobe but there are some differences regarding characters that cannot be mapped to Unicode. Aspose and Adobe just insert different symbols instead of those characters. Please be patient and wait for the latest version release. You will be notified via this forum thread once the new version is available for download.

Sorry for the inconvenience,

Thanks for your patience. Please note that the issues you have found earlier (filed as PDFNEWNET-32158) have been fixed in this update.


This message was posted using Notification2Forum from Downloads module by aspose.notifier.

Hi,

i just tested the new Aspose.Pdf.dll 6.8.0.0 and the problem ist still unresolved.

Best regards, Martin

Hi Martin,

I have tested your template file with the latest version of Aspose.Pdf for .NET v6.8 and it works fine. I have attached the resultant file for your reference. If you are using any different file for testing, please share that file with us and we will check it.

Sorry for the inconvenience,