PDF text extraction shows garbage

m.pfeifer · November 14, 2017, 10:35am

Hello,

i’m using Aspose.Pdf.dll 17.11.0.0 to extract text from the attached pdf document.

The code is simple like this:
Aspose.Pdf.Text.TextAbsorber pdfTxt = null;
Aspose.Pdf.Document pdf = null;

        pdf = new Aspose.Pdf.Document(strPdfFile);
        pdf.Flatten();
        pdfTxt = new Aspose.Pdf.Text.TextAbsorber();
        pdfTxt.ExtractionOptions.FormattingMode = Aspose.Pdf.Text.TextExtractionOptions.TextFormattingMode.Pure;
        pdf.Pages.Accept(pdfTxt);

        //Assert.IsTrue(pdfTxt.Text.Length > 0);
        if (File.Exists(strDestFN)) File.Delete(strDestFN);
        writer = new StreamWriter(strDestFN);
        writer.Write(pdfTxt.Text);
        writer.Close();
        pdf.Dispose();

Best regards, Martin Pfeifer

imran.rafique · November 14, 2017, 2:12pm

@m.pfeifer,

There is no any PDF document attached to your post. Kindly send us your source PDF. We will investigate and share our findings with you.