Hi Allan,
Thanks for your patience.
Changing font size of the multitude of text fragments is time consuming operation. To improve formatted text extraction result for text containing columns, we have introduced TextExtractionOptions.ScaleFactor
option. It makes ‘virtual’ scaling (decreasing) of font size without actual changes in the document. It produces effect on text extraction result that almost equivalent to changing font size but works much more faster.
Please also take into account that ‘pdfDocument.Pages.Accept(textAbsorber);
’ and ‘textAbsorber.Visit(pdfDocument);
’ is two different call of single function (text extraction). Using both of them is unnecessary and spends processing time.
Please consider the following code:
Stopwatch sw = new Stopwatch();
sw.Start();
Document pdfDocument = new Document(myDir + "DOAL-03-07-2015.pdf");
TextExtractionOptions options = new TextExtractionOptions(TextExtractionOptions.TextFormattingMode.Pure);
options.ScaleFactor = 0.5;
TextAbsorber textAbsorber = new TextAbsorber();
textAbsorber.ExtractionOptions = options;
pdfDocument.Pages.Accept(textAbsorber);
String extractedText = textAbsorber.Text;
System.IO.File.WriteAllText(myDir + "Extracted.txt", extractedText);
sw.Stop();
Console.WriteLine(sw.Elapsed.TotalSeconds);
In our test system, above code snippet took ~10 Seconds to execute. Output files are also attached for your reference.
OutputFiles.zip (529.7 KB)
Please use above code snippet with Aspose.PDF for .NET 18.12 and in case of any further assistance, please feel free to let us know.