ExtractText

word · October 29, 2009, 4:14pm

Hello,

We recently upgraded to Aspose.Pdf.Kit 3.6.0.0 and we are getting different results extracting the text from PDF files. What are the characters (see attached "Extracted text.txt") that show before every number for the text extracted from the attached "Test 16.pdf"? These characters are making our application behave different to before the upgrade.

Our code snippet,

PdfExtractor extractor = new PdfExtractor();

extractor.BindPdf(path);

extractor.ExtractText();

MemoryStream os = new MemoryStream();

extractor.GetText(os);

string s = new UnicodeEncoding().GetString(os.ToArray());

os.Close();

Thanks,

Juno

shahzadlatif · October 30, 2009, 5:23am

Hi Juno,

I have tested this file with with 3.5, 3.6 and 3.8 and result is the same with all of these. In fact, it is not the problem with the component, rather the file contains CR and LF characters as \r and \n, which represent carriage return and new line feed. These characters are the part of the PDF content. So, if you don’t want these characters in the extracted text, you can replace with string replace method in your code, once the text is extracted.

I hope this helps. If you still have any questions or find some problems, please do let us know.
Regards,

word · October 30, 2009, 7:12am

Hello,

The carriage return and new line are fine, I was talking about the characters before the numbers, please see attached 'Extracted text.bmp'.

Thanks,

Juno

shahzadlatif · October 30, 2009, 1:08pm

Hi Juno,

I have again tested the issue at my end but couldn’t reproduce the issue. Can you please share some more details i.e. OS/Machine specs, VS.NET version etc. We’ll have to reproduce the issue at our end to resolved it.

We’re sorry for the inconvenience.
Regards,

word · October 30, 2009, 1:45pm

Hello,

Please see below,

System:
Windows XP Professional Version 2002 Service Pack 3
Computer:
Intel(R) Core(TM)2 CPU
6400 @ 2.13GHz
3.00 GB of RAM

Visual Studio 2008 Version 9.0.30729.1 SP
.NET Framework Version 3.5 SP1

We are using Aspose.Pdf.Kit.dll from "C:\Program Files\Aspose\Aspose.Pdf.Kit for .NET\Bin\net11" and we also tried with the one from "C:\Program Files\Aspose\Aspose.Pdf.Kit for .NET\Bin\net35" with the same results.

Please let me know if you need any other information.

Thanks,

Juno

word · October 30, 2009, 2:09pm

The character code is 65279.

Extracting the text to a file and opening from WordPad and Notepad this is what I see (attached),

shahzadlatif · October 31, 2009, 12:12pm

Hi Juno,

Thank you very much for sharing the details. We’re looking into the matter at our end. Please spare us some time for detailed investigation.

We’re sorry for the inconvenience.
Regards,

shahzadlatif · November 2, 2009, 8:53am

Hi Juno,

I couldn’t reproduce the issue at my end, although I tested the issue with the specifications mentioned by you. Can you please share some other sample files (two or three) which are having the same problem? In fact, in order to understand the issue and then resolve it, we need to reproduce the issue at our end.

We’re sorry for the inconvenience.
Regards,

word · November 2, 2009, 9:06am

Hello,

I understand that you need to replicate in order to fix the problem. I will request some other example files from our QA team.

Thanks,

Juno

word · November 2, 2009, 9:47am

Hello,

How are you extracting the text from the PDF file? Can you share your code?

Thanks,

Juno

shahzadlatif · November 2, 2009, 9:51am

Hi Juno,

I’m using the same code as shared by you.

Regards,

word · November 2, 2009, 10:26am

Hello,

I cannot see what are we doing differently. Please see attached another file I have the same problem with.

Thanks,

Juno

word · November 2, 2009, 2:03pm

Hello,

After looking at some other PDF files, I'm basically seeing the character with code 65279 at the beginning of each page for any PDF (see another example attached).

Thanks,

Juno

shahzadlatif · November 3, 2009, 11:27am

Hi Juno,

This issue is logged as PDFKITNET-11494 in our issue tracking system. Our team is looking into the matter and this issue will be resolved in our monthly release due at the end of November.

We appreciate your patience.
Regards,

word · December 1, 2009, 11:09am

Hello,

Do you have an update on this?

Thanks,

Juno

shahzadlatif · December 3, 2009, 4:28am

Hi Juno,

This issue is resolved in our upcoming version. The new release will be published in this week.

If you have any other questions, please do let us know.
Regards,

aspose.notifier · December 4, 2009, 12:55pm

The issues you have found earlier (filed as 11494) have been fixed in this update.

This message was posted using Notification2Forum from Downloads module by aspose.notifier.

word · December 7, 2009, 8:50am

Hello,

After replacing Aspose.Pdf.Kit with the latest version (3.9.0.0) I'm still getting the unicode byte order mark at the beginning of the text extracted from any PDF page. I opened Windows process explorer while running our application an it shows the latest version of Aspose.Pdf.Kit (3.9.0.0). Is there something else I can check to make sure I'm using the correct dll?

Thanks,

Juno

shahzadlatif · December 8, 2009, 1:21pm

Hi Juno,

We’re looking into this problem and you’ll be updated the earliest possible.

We’re sorry for the inconvenience.
Regards,

word · December 9, 2009, 8:40am

Hello,

Is it safe to assume that the unicode bom is going to be at the beginning of the text extracted from each page for **any** PDF? We may have to patch our code to ignore the bom if the fix for this takes much longer.

Thanks,

Juno