Extract text from PDF document in C# with Aspose.PDF - API is taking long time

Simon_Karpen · December 4, 2018, 2:33am

Hello,

This is a 3MB document and the pdf library is taking over 120 secs to process the text tags on this document. Could you please check why text tag extraction is so slow on this document?

Thank you,
Anupam

Farhan.Raza · December 4, 2018, 11:28am

@Simon_Karpen

Thank you for contacting support.

Would you please share narrowed down code snippet so that we may try to reproduce and investigate it in our environment. Please ensure using Aspose.PDF for .NET 18.11 in your environment.

Simon_Karpen · December 4, 2018, 11:22pm

Hello Farhan,

I updated to the latest version of .Net library and tried to process the document the result is the same. It is still taking around 120 secs to process the document. Here is the code snippet. I basically have a regular expression that I am passing into the Text.TextFragmentAbsorber and getting all matches.

Gave you the document, here is the code

public IEnumerable<InviteTag> GetFieldsInvites(String filename)
    {
        String EMAIL_PATTERN = @"[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*@(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?";
        
    String MYTEXT = "{?{?t:e;o:[\x22\u201C\u201D\u201E][\\w\\s]+[\x22\u201C\u201D\u201E];e:[\x22\u201C\u201D\u201E]" + EMAIL_PATTERN + "[\x22\u201C\u201D\u201E];?(order:[0-9]+;)?}?}";

        //open document
        Aspose.Pdf.Document pdfDocument = new Aspose.Pdf.Document(filename);
        //create TextAbsorber object to find all the phrases matching the regular expression
        Aspose.Pdf.Text.TextFragmentAbsorber textFragmentAbsorber = new Aspose.Pdf.Text.TextFragmentAbsorber(MYTEXT);

        //set text search option to specify regular expression usage
        Aspose.Pdf.Text.TextSearchOptions textSearchOptions =
                                                 new Aspose.Pdf.Text.TextSearchOptions(true);
        textFragmentAbsorber.TextSearchOptions = textSearchOptions;
        //accept the absorber for all the pages
        pdfDocument.Pages.Accept(textFragmentAbsorber);
        //get the extracted text fragments
        Aspose.Pdf.Text.TextFragmentCollection textFragmentCollection = textFragmentAbsorber.TextFragments;
        //loop through the fragments
        foreach (Aspose.Pdf.Text.TextFragment textFragment in textFragmentCollection)
        {
            //Break it down & form the json data
            InviteTag mInviteTag = new InviteTag(textFragment.Text);
            mInviteTags.Add(mInviteTag);
        }
        //loop through the fragments
        foreach (Aspose.Pdf.Text.TextFragment textFragment in textFragmentCollection)
        {
            //update text and other properties
            var tabs = new StringBuilder();
            //tabs.Append(' ', MYTEXT.Length + 20);
            tabs.Append(' ', textFragment.Text.Length + 20);
            textFragment.Text = tabs.ToString();
        }
        pdfDocument.Save(root + "/" + id + ".pdf");
        return mInviteTags;
    }

I am using other regular expressions too that I can share if needed. Let me know if you need any other info.

Thank you,
Anupam

Farhan.Raza · December 5, 2018, 10:02am

@Simon_Karpen

Thank you for elaborating it further.

We are afraid shared code snippet can not be compiled owing to missing InviteTag class. Kindly share SSCCE code or better a narrowed down sample application so that we may proceed to reproduce your issue and take care of your concerns accordingly.

Simon_Karpen · February 19, 2019, 3:28pm

please, find the code below

using (var pdf = new Document(@“random.pdf”))
{
var RegexPattern = “{?{?t:[sitdcfcr];r:[yn];o:[\x22\u201C\u201D\u201E][\w\s]+[\x22\u201C\u201D\u201E];(l:[\x22\u201C\u201D\u201E][\w\s]+[\x22\u201C\u201D\u201E];)?(c:[0-9]+;)?(i:[0-9]+;)?(dd:[\x22\u201C\u201D\u201E][\w\s]+?,?.+?[\x22\u201C\u201D\u201E];)?(w:[0-9]+;)?(h:[0-9]+;)?(v:[\x22\u201C\u201D\u201E][\w\s]+[\x22\u201C\u201D\u201E];)?(rn:[\x22\u201C\u201D\u201E][\w\s-]+[\x22\u201C\u201D\u201E];)?(rv:[\x22\u201C\u201D\u201E][\w\s-]+[\x22\u201C\u201D\u201E];)?(checked:[0-9]+;)?}?}”;
var textFragmentAbsorber = new Aspose.Pdf.Text.TextFragmentAbsorber(RegexPattern);
var textSearchOptions = new Aspose.Pdf.Text.TextSearchOptions(true);
textFragmentAbsorber.TextSearchOptions = textSearchOptions;
pdf.Pages.Accept(textFragmentAbsorber);
//get the extracted text fragments
var textFragmentCollection = textFragmentAbsorber.TextFragments;
var stopwatch = new Stopwatch();
stopwatch.Start();
foreach (TextFragment textFragment in textFragmentCollection)
{
textFragment.Text = “”.PadLeft(textFragment.Text.Length);
}
stopwatch.Stop();
Console.WriteLine(stopwatch.ElapsedMilliseconds);
Console.ReadLine();
}

Farhan.Raza · February 19, 2019, 8:24pm

@Simon_Karpen

Thank you for getting back to us.

We have worked with the code snippet shared by you. However, it does not reproduce the problem with the PDF document you had shared earlier. Kindly create a narrowed down sample application so that we may proceed to help you out. Before sharing requested data, please ensure using Aspose.PDF for .NET 19.2 in your environment.

Simon_Karpen · February 20, 2019, 9:58am

@Farhan.Raza
Thanks for your reply. I’m not sure what exactly you are unable to reproduce but guess you get less 120 sec processing time. I thing the problem was not described clearly in previous post so I’ll try to clarify.
According to requirements I implemented a code you can see in previous post. The code should find text by regular expression, clear it but preserve empty space. The first idea was to replace the text with spaces. For most of documents it works well, but for some PDFs it works very slow (on my laptop it takes about 55 seconds). The slowest part is textFragment.Text = “”.PadLeft(textFragment.Text.Length); It takes 98% CPU time. So the main question is how can we improve the speed?
It looks like there is something in the document that causes slow processing time. Maybe there is a way to make some fast conversion to more simple format or use another approach to clear text fragments.

I tried the 19.2 but it is even slower. It tooks 10 seconds more to complete my processing.

Farhan.Raza · February 20, 2019, 8:11pm

@Simon_Karpen

Thank you for elaborating it further.

We are afraid that shared code snippet prints zero for the value of stopwatch.ElapsedMilliseconds on console, while working with the PDF document shared by you in very first post of this topic. Therefore, we have requested for a narrowed down sample application along with respective PDF document so that we may proceed further to assist you.

Moreover, you may try to iterate through each page and perform the operation and then share your kind feedback with us.

foreach (Page page in pdf.Pages)
{
    page.Accept(textFragmentAbsorber);
    //get the extracted text fragments
    var textFragmentCollection = textFragmentAbsorber.TextFragments;
    var stopwatch = new Stopwatch();
    stopwatch.Start();
    foreach (TextFragment textFragment in textFragmentCollection)
    {
        textFragment.Text = "".PadLeft(textFragment.Text.Length);
    }
    stopwatch.Stop();
    Console.WriteLine("Time Consumed: " + stopwatch.ElapsedMilliseconds);
}

Simon_Karpen · February 21, 2019, 8:55am

Weird… Please, find archived solution along with the document. NOTE: license file is not included.
I tried the Pages approach but it is slower than original one.
Also I found a way to make the solution 2 times faster with this code:

foreach (TextFragment textFragment in textFragmentCollection)
        {
            foreach (TextSegment segment in textFragment.Segments)
            {
                segment.Text = "";
            }
        }

The code runs 30 sec instead on 64 and seems to do what I need. Still looking for improvements.
By the way, is Aspose still not multi thread-friendly?

Farhan.Raza · February 21, 2019, 6:59pm

@Simon_Karpen

Thank you for sharing requested data.

We have been able to notice slow performance while working with shared solution and have also verified improved performance with TextSegment approach. We have logged a ticket with ID PDFNET-46046 in our issue management system for further investigations. We will let you know as soon as some significant update will be available in this regard.

Moreover, Aspose.PDF for .NET is multi-thread safe API as long as only one thread works on one document at a time. So different threads can safely work on different documents at the same time.

Simon_Karpen · April 1, 2019, 9:03am

Hello @Farhan.Raza!
Is there any update for me?

Farhan.Raza · April 1, 2019, 6:48pm

@Simon_Karpen

Thank you for getting back to us.

Please note PDFNET-46046 was logged under free support model where tickets are scheduled under first come first serve policy. It has recently been logged and may take some months to be resolved.

However, we also offer Paid Support, where issues are used to be investigated with higher priority. In case your reported issue is a blocker, you may please consider subscribing for Paid Support. For further information, please visit Paid Support FAQs.

Simon_Karpen · January 10, 2020, 2:17pm

Hello! Is there any update?
Also we got a PDF that behaves differently on segment.Text = “”. The document causes text to be shifted to the left. Is there another way to clear text while not shifting the line to preserve empty space?

We use Aspose 19.0.3

Adnan.Ahmad · January 10, 2020, 4:44pm

@Simon_Karpen,

I regret to inform that issue is still unresolved. We are working on this and will share good news with you soon.

Can you please share files with us along with sample code so that we may further investigate to help you out. Also please try to use Aspose.PDF latest version 20.1 on your end before sharing requested information.

Simon_Karpen · January 13, 2020, 2:54pm

Can you please share files with us along with sample code so that we may further investigate to help you out. Also please try to use Aspose.PDF latest version 20.1 on your end before sharing requested information.

I’ve tried the 20.1 trial and the result is the same. I’ve created separate post to discuss this question. Remove text from PDF document using Aspose.PDF for .NET - The empty space is not preserved

Adnan.Ahmad · January 13, 2020, 8:58pm

@Simon_Karpen,

Sure, we are available to help you and will respond accordingly in other thread of yours.

asad.ali · February 18, 2020, 6:31pm

@Simon_Karpen

We have investigated the issue. It is found that the performance at least is 20% better with Aspose.PDF 20.01 in comparison with the 19.2 version.
OR please specify the option:

textFragmentAbsorber.TextReplaceOptions = new TextReplaceOptions(TextReplaceOptions.ReplaceAdjustment.None);

Adjustment is not required for the text removing. But disabling it makes processing more faster.

Unfortunately, we have no other method of text removing except for the mentioned scenario. Text edition is a time-consuming operation because of many reasons. So we have no other advice to improve performance yet.

We have created PDFNET-47562 task to add improved mechanism for removing the text. But regretfully no promises about terms.

We have also tested the scenario with 20.2 version of the API and results were better using the following code:

using (var pdf = new Document(@"NewPackage83a - Text Tags Recognition Test.pdf"))
{
    var RegexPattern = "{?{?t:[sitdcfcr];r:[yn];o:[\x22\u201C\u201D\u201E][\\w\\s]+[\x22\u201C\u201D\u201E];(l:[\x22\u201C\u201D\u201E][\\w\\s]+[\x22\u201C\u201D\u201E];)?(c:[0-9]+;)?(i:[0-9]+;)?(dd:[\x22\u201C\u201D\u201E][\\w\\s]+?,?.+?[\x22\u201C\u201D\u201E];)?(w:[0-9]+;)?(h:[0-9]+;)?(v:[\x22\u201C\u201D\u201E][\\w\\s]+[\x22\u201C\u201D\u201E];)?(rn:[\x22\u201C\u201D\u201E][\\w\\s-]+[\x22\u201C\u201D\u201E];)?(rv:[\x22\u201C\u201D\u201E][\\w\\s-]+[\x22\u201C\u201D\u201E];)?(checked:[0-9]+;)?}?}";
    var textFragmentAbsorber = new Aspose.Pdf.Text.TextFragmentAbsorber(RegexPattern);
    var textSearchOptions = new Aspose.Pdf.Text.TextSearchOptions(true);
    textFragmentAbsorber.TextSearchOptions = textSearchOptions;                
    textFragmentAbsorber.TextReplaceOptions = new TextReplaceOptions(TextReplaceOptions.ReplaceAdjustment.None);

    pdf.Pages.Accept(textFragmentAbsorber);
    var textFragmentCollection = textFragmentAbsorber.TextFragments;
    var stopwatch = new Stopwatch();
    stopwatch.Start();
    foreach (TextFragment textFragment in textFragmentCollection)
    {
        textFragment.Text = "".PadLeft(textFragment.Text.Length);
    }
    stopwatch.Stop();
    Console.WriteLine(stopwatch.ElapsedMilliseconds);

    pdf.Save(@"46046_out.pdf");