Hi,
I’m trying to extract all text from a PDF with 3 columns.
I need the text to be extracted following the columns.
I’ve tried the following code:
TextAbsorber textAbsorber = new TextAbsorber();
pdfDoc.Pages.Accept(textAbsorber);
extractedText = textAbsorber.Text;
AND
PdfExtractor extractor = new PdfExtractor();
extractor.BindPdf(pdfDoc);
extractor.ExtractText(System.Text.Encoding.UTF8);
using (MemoryStream ms = new MemoryStream())
{
extractor.GetText(ms);
ms.Seek(0, SeekOrigin.Begin);
StreamReader sr = new StreamReader(ms);
extractedText = sr.ReadToEnd();
}
The result for both approaches was the same: the extractor ignored the columns and extract the text as if it was in only one column, so each line had information of all 3 columns.
Are there an alternative for this scenario?
Follow attached a print screen of the result and the source pdf.
Thanks!
TESTE3.pdf (129.2 KB)
2018-02-27_15h29_44.png (33.1 KB)
@vinicius.carvalho
Thank you for contacting support.
I would like to share with you that Aspose.PDF extracts all the text from a PDF file by using the code snippet shared by you. It does not extract any other information like table, column, row etc. It simply extracts plain text. However, please specify if you want to extract text from a particular page region. If your requirements are different, then please elaborate with the help of screenshots; so that we may guide you accordingly.
Hi @Farhan.Raza,
Thanks for your response.
What I need is plain text of all content of the PDF (not specific region).
To be specific, I need the text to be extracted in the same format as if I do it using CTRL A (on the PDF) + CTRL C (on the PDF) + CTRL V (on notepad) - See attached file.
2018-02-28_08h35_55.png (15.0 KB)
@vinicius.carvalho
Thank you for elaborating it further.
I would like to request you to use below code snippet in your environment. It extracts plain text instead of column wise text. I have attached generated TXT file for your kind reference TESTE3_18.2.zip.
// Open document
Document pdfDocument = new Document(dataDir + "TESTE3.pdf");
// Create TextAbsorber object to extract text
//TextAbsorber textAbsorber = new TextAbsorber();
TextAbsorber textAbsorber = new TextAbsorber(new TextExtractionOptions(TextExtractionOptions.TextFormattingMode.MemorySaving));
// Accept the absorber for all the pages
pdfDocument.Pages.Accept(textAbsorber);
// Get the extracted text
string extractedText = textAbsorber.Text;
// Create a writer and open the file
TextWriter tw = new StreamWriter(dataDir + "TESTE3_18.2.txt", false, System.Text.Encoding.UTF8);
// Write a line of text to the file
tw.WriteLine(extractedText);
// Close the stream
tw.Close();
Please share if generated file is acceptable as per your requirements, or share if you notice any problem so that we may investigate further to help you out.