Extracting text from PDF with columns

Hi,

I’m trying to extract all text from a PDF with 3 columns.
I need the text to be extracted following the columns.

I’ve tried the following code:

TextAbsorber textAbsorber = new TextAbsorber();
pdfDoc.Pages.Accept(textAbsorber);
extractedText = textAbsorber.Text;

AND

PdfExtractor extractor = new PdfExtractor();
extractor.BindPdf(pdfDoc);
extractor.ExtractText(System.Text.Encoding.UTF8);
using (MemoryStream ms = new MemoryStream())
{
extractor.GetText(ms);
ms.Seek(0, SeekOrigin.Begin);
StreamReader sr = new StreamReader(ms);
extractedText = sr.ReadToEnd();
}

The result for both approaches was the same: the extractor ignored the columns and extract the text as if it was in only one column, so each line had information of all 3 columns.

Are there an alternative for this scenario?

Follow attached a print screen of the result and the source pdf.

Thanks!

TESTE3.pdf (129.2 KB)
2018-02-27_15h29_44.png (33.1 KB)

@vinicius.carvalho

Thank you for contacting support.

I would like to share with you that Aspose.PDF extracts all the text from a PDF file by using the code snippet shared by you. It does not extract any other information like table, column, row etc. It simply extracts plain text. However, please specify if you want to extract text from a particular page region. If your requirements are different, then please elaborate with the help of screenshots; so that we may guide you accordingly.

Hi @Farhan.Raza,

Thanks for your response.
What I need is plain text of all content of the PDF (not specific region).

To be specific, I need the text to be extracted in the same format as if I do it using CTRL A (on the PDF) + CTRL C (on the PDF) + CTRL V (on notepad) - See attached file.

2018-02-28_08h35_55.png (15.0 KB)

@vinicius.carvalho

Thank you for elaborating it further.

I would like to request you to use below code snippet in your environment. It extracts plain text instead of column wise text. I have attached generated TXT file for your kind reference TESTE3_18.2.zip.

       // Open document
       Document pdfDocument = new Document(dataDir + "TESTE3.pdf");

       // Create TextAbsorber object to extract text
       //TextAbsorber textAbsorber = new TextAbsorber();
       TextAbsorber textAbsorber = new TextAbsorber(new TextExtractionOptions(TextExtractionOptions.TextFormattingMode.MemorySaving));
       // Accept the absorber for all the pages
       pdfDocument.Pages.Accept(textAbsorber);
       // Get the extracted text
       string extractedText = textAbsorber.Text;
       // Create a writer and open the file
       TextWriter tw = new StreamWriter(dataDir + "TESTE3_18.2.txt", false, System.Text.Encoding.UTF8);
       // Write a line of text to the file
       tw.WriteLine(extractedText);
       // Close the stream
       tw.Close();

Please share if generated file is acceptable as per your requirements, or share if you notice any problem so that we may investigate further to help you out.