We're sorry Aspose doesn't work properply without JavaScript enabled.

Free Support Forum - aspose.com

Absorber - How to get the rows without spaces (formatting) left?

Hello I am using Aspose.Pdf (9.4.0) and after using the command "pdfDocument.Pages [page] Accept (textAbsorber);".
I get the rows from the page with their respective spaces at the beginning of each line.

Is there any parameter to automatically remove this "formatting" of spaces at the beginning of each paragraph?

The code snippet that I use is this (page 1 of attachment):
for (int page = PaginaInicial; page <= PaginaFinal; page++)
{
textAbsorber = new TextAbsorber();
textAbsorber.TextSearchOptions.LimitToPageBounds = true;
textAbsorber.TextSearchOptions.Rectangle = new Aspose.Pdf.Rectangle(0, 20, 300, 672);
pdfDocument.Pages[page].Accept(textAbsorber);
linha = textAbsorber.Text;
}

Following figure for clarity.

Hi Maria,


Thanks for your inquiry. I am afraid currently TextAbsorber does not support to extract data without formatting. However we have logged an enhancement ticket PDFNEWNET-37254 in our issue tracking system for the purpose. We will notify you as soon as it is resolved.

Moreover, please note TextDevice class supports to extract data without formatting. Please check sample code snippet as following. Hopefully it will help you to accomplish the task.

//open document<o:p></o:p>

Document pdfDocument = new Document(myDir + "TRT_22-07-2014.pdf");

//string to hold extracted text

string extractedText;

System.Text.StringBuilder builder = new System.Text.StringBuilder();

using (MemoryStream textStream = new MemoryStream())

{

//create text device

TextDevice textDevice = new TextDevice();

//set text extraction options - set text extraction mode (Raw or Pure)

TextExtractionOptions textExtOptions = new

TextExtractionOptions(TextExtractionOptions.TextFormattingMode.Raw);

textDevice.ExtractionOptions = textExtOptions;

//convert a particular page and save text to the stream

textDevice.Process(pdfDocument.Pages[1], textStream);

//close memory stream

textStream.Close();

//get text from memory stream

extractedText = Encoding.Unicode.GetString(textStream.ToArray());

builder.Append(extractedText);

}

File.WriteAllText(myDir+"textdevice_Extracted_raw.txt", builder.ToString());

Please feel free to contact us for any further assistance.


Best Regards,

Hi Maria,


Thanks for your patience. Please note TextAbsorber also supports raw text extraction mode. Kindly use the feature as following. It will help you to accomplish the task.

TextAbsorber textAbsorber
= new TextAbsorber(new TextExtractionOptions(TextExtractionOptions.TextFormattingMode.Raw));<o:p></o:p>

textAbsorber.TextSearchOptions.LimitToPageBounds = true;

textAbsorber.TextSearchOptions.Rectangle = new Aspose.Pdf.Rectangle(0, 20, 300, 672);

pdfDocument.Pages[page].Accept(textAbsorber);

linha = textAbsorber.Text;

Please feel free to contact us for any further assistance.


Best Regards,