Absorber - How to get the rows without spaces (formatting) left?

cadoria · July 24, 2014, 1:15pm

Hello I am using Aspose.Pdf (9.4.0) and after using the command "pdfDocument.Pages [page] Accept (textAbsorber);".

I get the rows from the page with their respective spaces at the beginning of each line.

Is there any parameter to automatically remove this "formatting" of spaces at the beginning of each paragraph?

The code snippet that I use is this (page 1 of attachment):

for (int page = PaginaInicial; page <= PaginaFinal; page++)

{

textAbsorber = new TextAbsorber();

textAbsorber.TextSearchOptions.LimitToPageBounds = true;

textAbsorber.TextSearchOptions.Rectangle = new Aspose.Pdf.Rectangle(0, 20, 300, 672);

pdfDocument.Pages[page].Accept(textAbsorber);

linha = textAbsorber.Text;

}

Following figure for clarity.

tilal.ahmad · July 25, 2014, 11:35am

Hi Maria,

Thanks for your inquiry. I am afraid currently TextAbsorber does not support to extract data without formatting. However, we have logged an enhancement ticket PDFNEWNET-37254 in our issue tracking system for the purpose. We will notify you as soon as it is resolved.

Moreover, please note TextDevice class supports to extract data without formatting. Please check sample code snippet as following. Hopefully, it will help you to accomplish the task.

//open document
Document pdfDocument = new Document(myDir + "TRT_22-07-2014.pdf");

//string to hold extracted text
string extractedText;
System.Text.StringBuilder builder = new System.Text.StringBuilder();

using (MemoryStream textStream = new MemoryStream())
{
    //create text device
    TextDevice textDevice = new TextDevice();

    //set text extraction options - set text extraction mode (Raw or Pure)
    TextExtractionOptions textExtOptions = new TextExtractionOptions(TextExtractionOptions.TextFormattingMode.Raw);

    textDevice.ExtractionOptions = textExtOptions;

    //convert a particular page and save text to the stream
    textDevice.Process(pdfDocument.Pages[1], textStream);

    //close memory stream
    textStream.Close();

    //get text from memory stream
    extractedText = Encoding.Unicode.GetString(textStream.ToArray());

    builder.Append(extractedText);
}

File.WriteAllText(myDir + "textdevice_Extracted_raw.txt", builder.ToString());

Please feel free to contact us for any further assistance.

Best Regards,

tilal.ahmad · August 11, 2014, 10:57pm

Hi Maria,

Thanks for your patience. Please note TextAbsorber also supports raw text extraction mode. Kindly use the feature as following. It will help you to accomplish the task.

TextAbsorber textAbsorber = new TextAbsorber(new TextExtractionOptions(TextExtractionOptions.TextFormattingMode.Raw));
textAbsorber.TextSearchOptions.LimitToPageBounds = true;
textAbsorber.TextSearchOptions.Rectangle = new Aspose.Pdf.Rectangle(0, 20, 300, 672);
pdfDocument.Pages[page].Accept(textAbsorber);
linha = textAbsorber.Text;

Please feel free to contact us for any further assistance.

Best Regards,