Copying text from one PDF to another is slow

Hello,
we want to copy (recognized OCR) text from one PDF to another.

This example source code works but is very very slow (the textBuilder.AppendText function takes about 3 minutes for one page of text)

Why is the method so slow? Are there any faster methods to copy text from one PDF to another?

public void CopyOCRx(string file_ocr, string file_non_ocr)
{            
    using (Aspose.Pdf.Document asposeDocTarget = new Aspose.Pdf.Document(file_non_ocr))
    {
        Aspose.Pdf.Page page_new = asposeDocTarget.Pages[1];
        using (Aspose.Pdf.Document asposeDoc = new Aspose.Pdf.Document(file_ocr))
        {
            System.Text.RegularExpressions.Regex regex = new System.Text.RegularExpressions.Regex(@"[\S]+");
            TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber(regex);
            Aspose.Pdf.Page page = asposeDoc.Pages[1];
            page.Accept(textFragmentAbsorber);
            TextFragmentCollection textFragmentCollection = textFragmentAbsorber.TextFragments;
            TextBuilder textBuilder = new TextBuilder(page_new);
            List<TextFragment> list = new List<TextFragment>();
            foreach (TextFragment textFragment in textFragmentCollection)
            {
                list.Add(textFragment); 
            }
            textBuilder.AppendText(list);
        }
        asposeDocTarget.Save("file_ocr_new");
    }

@BSchwab

Can you please share the sample PDF document for our reference as well along with the information of your environment and how much time the API takes at your end? Please also make sure that you tested with 24.6 version of the API. We will further proceed to assist you accordingly.

non_ocr.pdf (1.5 MB)
ocr.pdf (1.8 MB)

the aim is to copy the text from the document ocr.pdf into the non_ocr.pdf
runtime is quite exact 3 minutes

the environment should not play a role, the expectation is that a “simple” text copy is done within seconds and not in 3 minutes. a large document takes hours (and crashes after about 2 hours with a memory exeption)

anyway my system.
win 10
Intel(R) Core™ i7-1185G7 @ 3.00GHz
32GB Ram

Another point is - if I try the following:

  foreach (TextFragment textFragment in textFragmentCollection)
  {
      // textBuilder.AppendText(textFragment);
      page_new.Paragraphs.Add(textFragment);
  }

there is a exception at

asposeDocTarget.Save

→ System.NullReferenceException

But the runtime adding the text is quite good

@BSchwab

We were able to replicate both issues in our environment. We have logged them as below in our issue tracking system.

  • PDFNET-57613 - Time taken by the API while copying text
  • PDFNET-57614 - Exception while saving PDF after adding Text Fragment

We will look into these details and let you know as soon as the tickets are resolved. Please be patient and spare us some time.

We are sorry for the inconvenience.