Extracting PDF multi column to text is very slow

AllanKodak · July 23, 2015, 7:25am

Hello,

I'm with the PDF version 10.6.0.0 and I am trying to extract every page of a PDF 3mb in which it has multi columns (in a few pages, and not always the same amount of columns).

This PDF (attached) is one of the smallest I have, because I need to read other 30mb.

I'm using the piece of code below, but upon entering the loop, no out more.

How to solve?

string path = "D:\\Temp\\";

InitLicense();

Document pdfDocument = new Document(path + "net_New-age NED's.pdf");

TextFragmentAbsorber tfa = new TextFragmentAbsorber();

pdfDocument.Pages.Accept(tfa);

TextFragmentCollection tfc = tfa.TextFragments;

foreach (TextFragment tf in tfc)

{

//need to reduce font size at least for 70%

tf.TextState.FontSize = tf.TextState.FontSize * 0.7f;

}

Stream st = new MemoryStream();

pdfDocument.Save(st);

pdfDocument = new Document(st);

TextAbsorber textAbsorber = new TextAbsorber();

pdfDocument.Pages.Accept(textAbsorber);

String extractedText = textAbsorber.Text;

textAbsorber.Visit(pdfDocument);

System.IO.File.WriteAllText(path + "Extracted.txt", extractedText);

tilal.ahmad · July 23, 2015, 12:12pm

Hi Allan,

Thanks for your inquiry. In another similar query, we have already noticed the issue and logged a ticket PDFNEWNET-39090 for investigation and rectification. We will notify you as soon as we resolve the issue.

We are sorry for the inconvenience caused.

Best Regards,

asad.ali · December 27, 2018, 8:08pm

@AllanKodak

Thanks for your patience.

Changing font size of the multitude of text fragments is time consuming operation. To improve formatted text extraction result for text containing columns, we have introduced TextExtractionOptions.ScaleFactor option. It makes ‘virtual’ scaling (decreasing) of font size without actual changes in the document. It produces effect on text extraction result that almost equivalent to changing font size but works much more faster.

Please also take into account that ‘pdfDocument.Pages.Accept(textAbsorber);’ and ‘textAbsorber.Visit(pdfDocument);’ is two different call of single function (text extraction). Using both of them is unnecessary and spends processing time.

Please consider the following code:

Stopwatch sw = new Stopwatch();
sw.Start();

Document pdfDocument = new Document(myDir + "DOAL-03-07-2015.pdf");

TextExtractionOptions options = new TextExtractionOptions(TextExtractionOptions.TextFormattingMode.Pure);
options.ScaleFactor = 0.5;

TextAbsorber textAbsorber = new TextAbsorber();
textAbsorber.ExtractionOptions = options;

pdfDocument.Pages.Accept(textAbsorber);
String extractedText = textAbsorber.Text;

System.IO.File.WriteAllText(myDir + "Extracted.txt", extractedText);

sw.Stop();
Console.WriteLine(sw.Elapsed.TotalSeconds);

In our test system, above code snippet took ~10 Seconds to execute. Output files are also attached for your reference.

OutputFiles.zip (529.7 KB)

Please use above code snippet with Aspose.PDF for .NET 18.12 and in case of any further assistance, please feel free to let us know.