Hi Maria,
Thanks for your inquiry. I am afraid currently TextAbsorber does not support to extract data without formatting. However, we have logged an enhancement ticket PDFNEWNET-37254 in our issue tracking system for the purpose. We will notify you as soon as it is resolved.
Moreover, please note TextDevice class supports to extract data without formatting. Please check sample code snippet as following. Hopefully, it will help you to accomplish the task.
//open document
Document pdfDocument = new Document(myDir + "TRT_22-07-2014.pdf");
//string to hold extracted text
string extractedText;
System.Text.StringBuilder builder = new System.Text.StringBuilder();
using (MemoryStream textStream = new MemoryStream())
{
//create text device
TextDevice textDevice = new TextDevice();
//set text extraction options - set text extraction mode (Raw or Pure)
TextExtractionOptions textExtOptions = new TextExtractionOptions(TextExtractionOptions.TextFormattingMode.Raw);
textDevice.ExtractionOptions = textExtOptions;
//convert a particular page and save text to the stream
textDevice.Process(pdfDocument.Pages[1], textStream);
//close memory stream
textStream.Close();
//get text from memory stream
extractedText = Encoding.Unicode.GetString(textStream.ToArray());
builder.Append(extractedText);
}
File.WriteAllText(myDir + "textdevice_Extracted_raw.txt", builder.ToString());
Please feel free to contact us for any further assistance.
Best Regards,
Hi Maria,
Thanks for your patience. Please note TextAbsorber also supports raw text extraction mode. Kindly use the feature as following. It will help you to accomplish the task.
TextAbsorber textAbsorber = new TextAbsorber(new TextExtractionOptions(TextExtractionOptions.TextFormattingMode.Raw));
textAbsorber.TextSearchOptions.LimitToPageBounds = true;
textAbsorber.TextSearchOptions.Rectangle = new Aspose.Pdf.Rectangle(0, 20, 300, 672);
pdfDocument.Pages[page].Accept(textAbsorber);
linha = textAbsorber.Text;
Please feel free to contact us for any further assistance.
Best Regards,