Adobe reader shows 3 pages- Aspose.Pdf extracts 6 pages and can't extract text

Hi,

I couldn't get the text of the attached pdf extracted, I get this exception :

System.NullReferenceException: La référence d’objet n’est
pas définie à une instance d’un objet.<?xml:namespace prefix = “o” ns = “urn:schemas-microsoft-com:office:office” /><o:p></o:p>

à Aspose.Pdf.Text.TextAbsorber.Visit(Page
page)
à
Aspose.Pdf.Facades.PdfExtractor.ExtractText(Encoding encoding)<o:p></o:p>

, , and when i tried to extract the pages, using Aspose.Pdf.Facades.PdfExtractor, it extracts 6 pages instead of 3.

Could you help us on this ?

Regards,

Yassine

Hi Yassine,


Thanks for your inquiry. While tested the scenario with Aspose.Pdf for .NET 9.1.0, we have managed to reproduce the text extraction issue and logged it as PDFNEWNET-36739 in our issue tracking system for further investigation and resolution. We will notify you via this forum thread as soon as it its resolved.

However, we are unable to replicate the issue to extract 6 pages, we have test scenario with both DOM and Facades packages. Please share your sample code to reproduce the issue, so we will investigate it further.

We are sorry for the inconvenience caused.

Best Regards,

Hi,

Here is my code for the extraction :

using System;
using System.Text;
using DocumentContentExtractor.OCR;


namespace DocumentOcr.Parsers
{
public class PdfDocumentOcr : IContentRecognizer
{

static PdfDocumentOcr()
{
var pdfKitlicense = new Aspose.Pdf.License();
pdfKitlicense.SetLicense(AsposeLicense.Instance.ToStream());
}

public string GetText(string filename)
{
var pdfExtractor = new Aspose.Pdf.Facades.PdfExtractor();
var pdfFile = new System.IO.StreamReader(filename).BaseStream;
pdfExtractor.BindPdf(pdfFile);

var ret = new StringBuilder("");

pdfExtractor.Resolution = 20;
pdfExtractor.ExtractImage();
int pageNumber = 1;
try
{
while (pdfExtractor.HasNextImage())
{

var MS = new System.IO.MemoryStream();
pdfExtractor.GetNextImage(MS);
var bitmap = new System.Drawing.Bitmap(MS);
bitmap.Save(string.Format("e:\\CrazyPdf{0}.jpeg", pageNumber++));
//ret.Append(" ").Append(BitmapToText.GetText(bitmap));
bitmap.Dispose();
MS.Close();
MemoryManagement.FlushMemory();
}
}
catch (ArgumentException)
{
}
catch (OutOfMemoryException)
{
}
MemoryManagement.FlushMemory();

return ret.ToString();
}
}
}

Regards,

Yassine

Hi Yassin,


Thanks for sharing sample code. Actually your code is extracting images from the source document and document contains 6 images. There is PageNumber variable used in your code for image count, so mixing it with actual number of Document pages. You can verify number of images using Adobe Acrobat Tools>Document Processing>export all images. We will keep you updated about the original text extraction issue progress.

Please feel free to contact us for any further assistance.

Best Regards,

The issues you have found earlier (filed as PDFNEWNET-36739) have been fixed in Aspose.Pdf for .NET 9.3.0.

Blog post for this release can be viewed over this link


This message was posted using Notification2Forum from Downloads module by Aspose Notifier.

The issues you have found earlier (filed as ) have been fixed in this update. This message was posted using BugNotificationTool from Downloads module by MuzammilKhan