ExtractText

Hello,

We recently upgraded to Aspose.Pdf.Kit 3.6.0.0 and we are getting different results extracting the text from PDF files. What are the characters (see attached "Extracted text.txt") that show before every number for the text extracted from the attached "Test 16.pdf"? These characters are making our application behave different to before the upgrade.

Our code snippet,

PdfExtractor extractor = new PdfExtractor();

extractor.BindPdf(path);

extractor.ExtractText();

MemoryStream os = new MemoryStream();

extractor.GetText(os);

string s = new UnicodeEncoding().GetString(os.ToArray());

os.Close();

Thanks,

Juno

Hi Juno,

I have tested this file with with 3.5, 3.6 and 3.8 and result is the same with all of these. In fact, it is not the problem with the component, rather the file contains CR and LF characters as \r and \n, which represent carriage return and new line feed. These characters are the part of the PDF content. So, if you don’t want these characters in the extracted text, you can replace with string replace method in your code, once the text is extracted.

I hope this helps. If you still have any questions or find some problems, please do let us know.
Regards,

Hello,

The carriage return and new line are fine, I was talking about the characters before the numbers, please see attached 'Extracted text.bmp'.

Thanks,

Juno

Hi Juno,

I have again tested the issue at my end but couldn’t reproduce the issue. Can you please share some more details i.e. OS/Machine specs, VS.NET version etc. We’ll have to reproduce the issue at our end to resolved it.

We’re sorry for the inconvenience.
Regards,

Hello,

Please see below,

System:
Windows XP Professional Version 2002 Service Pack 3
Computer:
Intel(R) Core(TM)2 CPU
6400 @ 2.13GHz
3.00 GB of RAM

Visual Studio 2008 Version 9.0.30729.1 SP
.NET Framework Version 3.5 SP1

We are using Aspose.Pdf.Kit.dll from "C:\Program Files\Aspose\Aspose.Pdf.Kit for .NET\Bin\net11" and we also tried with the one from "C:\Program Files\Aspose\Aspose.Pdf.Kit for .NET\Bin\net35" with the same results.

Please let me know if you need any other information.

Thanks,

Juno

The character code is 65279.

Extracting the text to a file and opening from WordPad and Notepad this is what I see (attached),

Hi Juno,

Thank you very much for sharing the details. We’re looking into the matter at our end. Please spare us some time for detailed investigation.

We’re sorry for the inconvenience.
Regards,

Hi Juno,

I couldn’t reproduce the issue at my end, although I tested the issue with the specifications mentioned by you. Can you please share some other sample files (two or three) which are having the same problem? In fact, in order to understand the issue and then resolve it, we need to reproduce the issue at our end.

We’re sorry for the inconvenience.
Regards,




Hello,

I understand that you need to replicate in order to fix the problem. I will request some other example files from our QA team.

Thanks,

Juno

Hello,

How are you extracting the text from the PDF file? Can you share your code?

Thanks,

Juno

Hi Juno,

I’m using the same code as shared by you.

Regards,

Hello,

I cannot see what are we doing differently. Please see attached another file I have the same problem with.

Thanks,

Juno

Hello,

After looking at some other PDF files, I'm basically seeing the character with code 65279 at the beginning of each page for any PDF (see another example attached).

Thanks,

Juno

Hi Juno,

This issue is logged as PDFKITNET-11494 in our issue tracking system. Our team is looking into the matter and this issue will be resolved in our monthly release due at the end of November.

We appreciate your patience.
Regards,

Hello,

Do you have an update on this?

Thanks,

Juno

Hi Juno,

This issue is resolved in our upcoming version. The new release will be published in this week.

If you have any other questions, please do let us know.
Regards,

The issues you have found earlier (filed as 11494) have been fixed in this update.


This message was posted using Notification2Forum from Downloads module by aspose.notifier.

Hello,

After replacing Aspose.Pdf.Kit with the latest version (3.9.0.0) I'm still getting the unicode byte order mark at the beginning of the text extracted from any PDF page. I opened Windows process explorer while running our application an it shows the latest version of Aspose.Pdf.Kit (3.9.0.0). Is there something else I can check to make sure I'm using the correct dll?

Thanks,

Juno

Hi Juno,

We’re looking into this problem and you’ll be updated the earliest possible.

We’re sorry for the inconvenience.
Regards,

Hello,

Is it safe to assume that the unicode bom is going to be at the beginning of the text extracted from each page for **any** PDF? We may have to patch our code to ignore the bom if the fix for this takes much longer.

Thanks,

Juno