PdfExtractor extractText takes a long time

claytongutierrez · July 18, 2018, 6:23pm

Hi,

In two of our production environments, with two different customers the PDF text extraction is taking a long time (5 - 20 minutes, some maybe longer - in one instance 45 minutes).
The time appears to be consumed by a call that is being made to the com.aspose.pdf.facades.PdfExtractor classes’ extractText method.

I’ve currently 8 example PDFs:
CON2148000.pdf: PDF document, version 1.7, 417762 bytes
CON2163292.pdf: PDF document, version 1.6, 329696 bytes, 431 seconds
Credit.pdf: PDF document, version 1.6, 99827 bytes, 2791826ms (46 minutes)
BOR-GC-Checklist-CON2065161.pdf: PDF document, version 1.6, 1827922 bytes
Checklist-2-CON2147021.pdf: PDF document, version 1.7, 653901 bytes
Checklist-CON2147021.pdf: PDF document, version 1.7, 65300 bytes
CON2163292.pdf: PDF document, version 1.6, 329696 bytes
LBK-Contract.pdf: PDF document, version 1.7, 510885 bytes

I’m currently getting permission from one or both customers to send Aspose some of these samples.

One customer’s system is multi-user, multi-client system and has logging for timings.

The other customer’s system multi-user, single-client system but the logging for timings is a little different (we may need to alter this to ease investigation).

We have been evaluating all sorts of things to try to understand what is occuring on the system at the time these issues arise but have not found anything suspicous to date.

I believe we are running jdk1.7.0_80 but will need to confirm this. The aspose jar is aspose.pdf-17.9.jar.

A code sample:
PdfExtractor extractor = new PdfExtractor();
if (!NonNullString.isEmpty(_password)) {
ByteArrayOutputStream output = new ByteArrayOutputStream();
PdfFileSecurity security = new PdfFileSecurity(new ByteArrayInputStream(_input), output);
security.decryptFile(_password);
_input = output.toByteArray();
output.close();
}

getStart = System.currentTimeMillis();
extractor.bindPdf(new ByteArrayInputStream(_input));
getCompleted = System.currentTimeMillis();

LOG.info("execute: bindPdf completed. " + “Elapsed time: " + ((getCompleted - getStart) / 1000) + " seconds.”);

getStart = System.currentTimeMillis();
extractor.extractText();
getCompleted = System.currentTimeMillis();

LOG.info("execute: extractText completed. " + “Elapsed time: " + ((getCompleted - getStart) / 1000) + " seconds.”);

getStart = System.currentTimeMillis();
_outputText = getText(extractor);
getCompleted = System.currentTimeMillis();

LOG.info("execute: getText completed. " + “Elapsed time: " + ((getCompleted - getStart) / 1000) + " seconds.”);

extractor.close();

Have you any suggestions as to what I can try to do to further the investigation?

Clayton

claytongutierrez · July 18, 2018, 7:08pm

Correction - the JDK in use is 1.7.0_25.

Farhan.Raza · July 18, 2018, 8:38pm

@claytongutierrez

Thank you for contacting support.

Please upgrade to Aspose.PDF for Java 18.6 and try to increase Java heap size with xmx parameter of JVM, then mention your observations based on latest version of the API. Also mention your environment details for our reference (OS details, Memory details etc). Furthermore, you can also extract text from a PDF document with different DOM approaches as explained in Extract Text from PDF.

In case the problem persists, please create a narrowed down sample application along with the source files so that we may try to reproduce and investigate it in our environment.

claytongutierrez · July 24, 2018, 5:55pm

Here are 5 sample files that were slow when extracting the PDF text:
Dr. A_Redacted.pdf (50.6 KB)
Dr. Am_Redacted.pdf (50.7 KB)
Dr. Co.pdf (56.1 KB)
Dr. Gh.pdf (58.4 KB)
Dr. G.pdf (58.1 KB)

Farhan.Raza · July 24, 2018, 9:49pm

@claytongutierrez

We have tried below code snippet as mentioned over Extract Text from PDF and have noticed that each of the 4 attached files is converted to text within 2 seconds, except 5 seconds for Dr. A_Redacted.pdf file.

long start = System.currentTimeMillis();        
String file = "Dr. Gh";
// Open document
Document pdfDocument = new Document( dataDir + file + ".pdf");

// Create TextAbsorber object to extract text
TextAbsorber textAbsorber = new TextAbsorber();

// Accept the absorber for all the pages
pdfDocument.getPages().accept(textAbsorber);

// Get the extracted text
String extractedText = textAbsorber.getText();

// Create a writer and open the file
java.io.FileWriter writer = new java.io.FileWriter(new java.io.File( dataDir + file + ".txt"));
writer.write(extractedText);

// Write a line of text to the file tw.WriteLine(extractedText);
// Close the stream
writer.close();
long completed = System.currentTimeMillis();
System.out.println("Elapsed time: " + ((completed - start) / 1000) + " seconds.");

Please ensure using Aspose.PDF for Java 18.6 in your environment and then share your kind feedback with us.