We recently upgraded to Aspose.Pdf.Kit 3.6.0.0 and we are getting different results extracting the text from PDF files. What are the characters (see attached "Extracted text.txt") that show before every number for the text extracted from the attached "Test 16.pdf"? These characters are making our application behave different to before the upgrade.
Our code snippet,
PdfExtractor extractor = new PdfExtractor();
extractor.BindPdf(path);
extractor.ExtractText();
MemoryStream os = new MemoryStream();
extractor.GetText(os);
string s = new UnicodeEncoding().GetString(os.ToArray());
I have tested this file with with 3.5, 3.6 and 3.8 and result is the same with all of these. In fact, it is not the problem with the component, rather the file contains CR and LF characters as \r and \n, which represent carriage return and new line feed. These characters are the part of the PDF content. So, if you don’t want these characters in the extracted text, you can replace with string replace method in your code, once the text is extracted.
I hope this helps. If you still have any questions or find some problems, please do let us know. Regards,
I have again tested the issue at my end but couldn’t reproduce the issue. Can you please share some more details i.e. OS/Machine specs, VS.NET version etc. We’ll have to reproduce the issue at our end to resolved it.
System: Windows XP Professional Version 2002 Service Pack 3 Computer: Intel(R) Core(TM)2 CPU 6400 @ 2.13GHz 3.00 GB of RAM
Visual Studio 2008 Version 9.0.30729.1 SP .NET Framework Version 3.5 SP1
We are using Aspose.Pdf.Kit.dll from "C:\Program Files\Aspose\Aspose.Pdf.Kit for .NET\Bin\net11" and we also tried with the one from "C:\Program Files\Aspose\Aspose.Pdf.Kit for .NET\Bin\net35" with the same results.
Please let me know if you need any other information.
I couldn’t reproduce the issue at my end, although I tested the issue with the specifications mentioned by you. Can you please share some other sample files (two or three) which are having the same problem? In fact, in order to understand the issue and then resolve it, we need to reproduce the issue at our end.
After looking at some other PDF files, I'm basically seeing the character with code 65279 at the beginning of each page for any PDF (see another example attached).
This issue is logged as PDFKITNET-11494 in our issue tracking system. Our team is looking into the matter and this issue will be resolved in our monthly release due at the end of November.
After replacing Aspose.Pdf.Kit with the latest version (3.9.0.0) I'm still getting the unicode byte order mark at the beginning of the text extracted from any PDF page. I opened Windows process explorer while running our application an it shows the latest version of Aspose.Pdf.Kit (3.9.0.0). Is there something else I can check to make sure I'm using the correct dll?
Is it safe to assume that the unicode bom is going to be at the beginning of the text extracted from each page for **any** PDF? We may have to patch our code to ignore the bom if the fix for this takes much longer.