Extract text based on columns - very slow

marcoaraujolsys · July 21, 2015, 8:45am

Hello,

I'm with the PDF version 10.6.0.0 and I am trying to extract every page of a PDF 3mb in which it has multi columns (in a few pages, and not always the same amount of columns).

This PDF (attached) is one of the smallest I have, because I need to read other 30mb.

I'm using the piece of code below, but upon entering the loop, no out more.

How to solve?

string path = "D:\\Temp\\";

InitLicense();

Document pdfDocument = new Document(path + "net_New-age NED's.pdf");

TextFragmentAbsorber tfa = new TextFragmentAbsorber();

pdfDocument.Pages.Accept(tfa);

TextFragmentCollection tfc = tfa.TextFragments;

foreach (TextFragment tf in tfc)

{

//need to reduce font size at least for 70%

tf.TextState.FontSize = tf.TextState.FontSize * 0.7f;

}

Stream st = new MemoryStream();

pdfDocument.Save(st);

pdfDocument = new Document(st);

TextAbsorber textAbsorber = new TextAbsorber();

pdfDocument.Pages.Accept(textAbsorber);

String extractedText = textAbsorber.Text;

textAbsorber.Visit(pdfDocument);

System.IO.File.WriteAllText(path + "Extracted.txt", extractedText);

tilal.ahmad · July 23, 2015, 12:02pm

Hi Marco,

Thanks for your inquiry. We have manged to notice performance issue with text extraction from shared PDF document. So we have logged a ticket PDFNEWNET-39090 for further investigation and rectification. We will notify you as soon as it is resolved.

We are sorry for the inconvenience caused.

Best Regards,

asad.ali · December 27, 2018, 8:08pm

@marcoaraujolsys

Thanks for your patience.

Changing font size of the multitude of text fragments is time consuming operation. To improve formatted text extraction result for text containing columns, we have introduced TextExtractionOptions.ScaleFactor option. It makes ‘virtual’ scaling (decreasing) of font size without actual changes in the document. It produces effect on text extraction result that almost equivalent to changing font size but works much more faster.

Please also take into account that ‘pdfDocument.Pages.Accept(textAbsorber);’ and ‘textAbsorber.Visit(pdfDocument);’ is two different call of single function (text extraction). Using both of them is unnecessary and spends processing time.

Please consider the following code:

Stopwatch sw = new Stopwatch();
sw.Start();

Document pdfDocument = new Document(myDir + "DOAL-03-07-2015.pdf");

TextExtractionOptions options = new TextExtractionOptions(TextExtractionOptions.TextFormattingMode.Pure);
options.ScaleFactor = 0.5;

TextAbsorber textAbsorber = new TextAbsorber();
textAbsorber.ExtractionOptions = options;

pdfDocument.Pages.Accept(textAbsorber);
String extractedText = textAbsorber.Text;

System.IO.File.WriteAllText(myDir + "Extracted.txt", extractedText);

sw.Stop();
Console.WriteLine(sw.Elapsed.TotalSeconds);

In our test system, above code snippet took ~10 Seconds to execute. Output files are also attached for your reference.

OutputFiles.zip (529.7 KB)

Please use above code snippet with Aspose.PDF for .NET 18.12 and in case of any further assistance, please feel free to let us know.