Copying text from one PDF to another is slow

BSchwab · July 9, 2024, 8:16am

Hello,
we want to copy (recognized OCR) text from one PDF to another.

This example source code works but is very very slow (the textBuilder.AppendText function takes about 3 minutes for one page of text)

Why is the method so slow? Are there any faster methods to copy text from one PDF to another?

public void CopyOCRx(string file_ocr, string file_non_ocr)
{            
    using (Aspose.Pdf.Document asposeDocTarget = new Aspose.Pdf.Document(file_non_ocr))
    {
        Aspose.Pdf.Page page_new = asposeDocTarget.Pages[1];
        using (Aspose.Pdf.Document asposeDoc = new Aspose.Pdf.Document(file_ocr))
        {
            System.Text.RegularExpressions.Regex regex = new System.Text.RegularExpressions.Regex(@"[\S]+");
            TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber(regex);
            Aspose.Pdf.Page page = asposeDoc.Pages[1];
            page.Accept(textFragmentAbsorber);
            TextFragmentCollection textFragmentCollection = textFragmentAbsorber.TextFragments;
            TextBuilder textBuilder = new TextBuilder(page_new);
            List<TextFragment> list = new List<TextFragment>();
            foreach (TextFragment textFragment in textFragmentCollection)
            {
                list.Add(textFragment); 
            }
            textBuilder.AppendText(list);
        }
        asposeDocTarget.Save("file_ocr_new");
    }

asad.ali · July 9, 2024, 5:59pm

@BSchwab

Can you please share the sample PDF document for our reference as well along with the information of your environment and how much time the API takes at your end? Please also make sure that you tested with 24.6 version of the API. We will further proceed to assist you accordingly.

BSchwab · July 10, 2024, 6:54am

non_ocr.pdf (1.5 MB)
ocr.pdf (1.8 MB)

the aim is to copy the text from the document ocr.pdf into the non_ocr.pdf
runtime is quite exact 3 minutes

the environment should not play a role, the expectation is that a “simple” text copy is done within seconds and not in 3 minutes. a large document takes hours (and crashes after about 2 hours with a memory exeption)

anyway my system.
win 10
Intel(R) Core™ i7-1185G7 @ 3.00GHz
32GB Ram

BSchwab · July 10, 2024, 7:57am

Another point is - if I try the following:

  foreach (TextFragment textFragment in textFragmentCollection)
  {
      // textBuilder.AppendText(textFragment);
      page_new.Paragraphs.Add(textFragment);
  }

there is a exception at

asposeDocTarget.Save

→ System.NullReferenceException

But the runtime adding the text is quite good

asad.ali · July 10, 2024, 2:19pm

@BSchwab

We were able to replicate both issues in our environment. We have logged them as below in our issue tracking system.

PDFNET-57613 - Time taken by the API while copying text
PDFNET-57614 - Exception while saving PDF after adding Text Fragment

We will look into these details and let you know as soon as the tickets are resolved. Please be patient and spare us some time.

We are sorry for the inconvenience.

BSchwabVal · November 12, 2024, 10:54am

Hello, what is the status here?
From my point of view, these are standard features of the API that simply should work.

asad.ali · November 12, 2024, 7:56pm

@BSchwabVal

We agree with you and we will surely investigate the feasibility to support and add these features to the API. However, the tickets are prioritized on first come first serve basis in Free Support model and as soon as we make some progress towards resolution of these ticket(s), we will inform you via this forum thread. Please be patient and spare us some time.

We are sorry for the inconvenience.

BSchwabVal · November 14, 2024, 10:44am

Unfortunately, I am very dissatisfied.
This is a basic feature of the API and if you buy it you expect it to work and if not it should be fixed as a priority. First come first serve sounds very unsatisfactory without mentioning a specific time period.
Apart from that, we have several “paid tickets” that have not been resolved for a year - this doesn’t seem to help either. And even then, we have to buy another license to be able to use the version with the bug fixes. Unfortunately, this is all very disappointing

asad.ali · November 14, 2024, 6:26pm

@BSchwabVal

Please accept our humble apology for the inconvenience you have been facing. Please note that the paid support does not guarantee an immediate resolution. It only expedites the investigation process as the ticket gets to receive attention on priority basis. The resolution time of a ticket depends upon many factors like issue complexity, document structure and complexity, modules and components of the API involve in the scenario.

Aspose.PDF is a massive API having thousands of modules and it takes certain amount of time to carry out proper in-depth investigation for certain cases like yours which is related to the performance of the API and is complex in nature.

Nevertheless, we have recorded your comments and used them to raise the ticket(s) priority to the next level. We will consider your concerns and let you know once we have some updates.

aspose.notifier · December 20, 2024, 11:33pm

The issues you have found earlier (filed as PDFNET-57613,PDFNET-57614) have been fixed in Aspose.PDF for .NET 24.12.

BSchwabVal · January 6, 2025, 11:34am

Thanks for the bugfix, the problem seems to be fixed. Now I can test the actual coding and have one more question. Copying the text does not seem to work properly with rotated documents the text is copied incorrectly.
With non-rotated “normal” PDFs, everything seems to work correctly.

–

Here is the original (rotated) PDF: 12.pdf (46,3 KB)

The PDF with OCR layer: 12_ocr.pdf (157,2 KB)

New created Aspose PDF: 12_aspose.pdf (79,3 KB) (the text from 12_ocr.pdf was copied into the 12.pdf and saved as 12_aspose.pdf)

Here is the coding:

      string file_ocr = "C:\\Temp\\12_ocr.pdf";
      string file_org = "C:\\Temp\\12.pdf";            
      string file_new = "C:\\Temp\\12_aspose.pdf";

      CopyAspose(file_ocr, file_org, file_new);

  public static void CopyAspose(string file_ocr, string file_non_ocr, string file_new)
  {
      using (Aspose.Pdf.Document asposeDocTarget = new Aspose.Pdf.Document(file_non_ocr))
      {
          Aspose.Pdf.Page page_new = asposeDocTarget.Pages[1];
          using (Aspose.Pdf.Document asposeDoc = new Aspose.Pdf.Document(file_ocr))
          {
              System.Text.RegularExpressions.Regex regex = new System.Text.RegularExpressions.Regex(@"[\S]+");
              TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber(regex);
              Aspose.Pdf.Page page = asposeDoc.Pages[1];
              page.Accept(textFragmentAbsorber);
              TextFragmentCollection textFragmentCollection = textFragmentAbsorber.TextFragments;
              TextBuilder textBuilder = new TextBuilder(page_new);
              List<TextFragment> list = new List<TextFragment>();
              foreach (TextFragment textFragment in textFragmentCollection)
              {
                  list.Add(textFragment);
              }
              textBuilder.AppendText(list);
          }
          asposeDocTarget.Save(file_new);
      }
  }

asad.ali · January 6, 2025, 2:29pm

@BSchwab

We have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.

Issue ID(s): PDFNET-58960

You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.

aspose.notifier · February 14, 2025, 5:38pm

The issues you have found earlier (filed as PDFNET-58960) have been fixed in Aspose.PDF for .NET 25.2.

BSchwab · March 3, 2025, 1:53pm

Thank you for the bugfix.

However, I have another issue. It might just be an edge case, but with very large documents, a System.OutOfMemory exception occurs. Even with fewer pages, the memory usage increases significantly.


 using (Aspose.Pdf.Document asposeDocOCR = new Aspose.Pdf.Document(@"C:\Temp\pflanze_ocr.pdf"))
 {
     int pageCount = asposeDocOCR.Pages.Count();
     using (Aspose.Pdf.Document asposeDocTarget = new Aspose.Pdf.Document(@"C:\Temp\pflanze.pdf"))
     {
         for (int i = 1; i <= pageCount; i++)
         {          
             Aspose.Pdf.Page pageTarget = asposeDocTarget.Pages[i];
             Aspose.Pdf.Page pageSource = asposeDocOCR.Pages[i];

             System.Text.RegularExpressions.Regex regex = new System.Text.RegularExpressions.Regex(@"[\S]+");
             TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber(regex);                        
             pageSource.Accept(textFragmentAbsorber);
             TextBuilder textBuilder = new TextBuilder(pageTarget);
             List<TextFragment> list = new List<TextFragment>();
             foreach (TextFragment textFragment in textFragmentAbsorber.TextFragments)
             {
                 list.Add(textFragment);     
             }
             textBuilder.AppendText(list);

         }
         asposeDocTarget.Save(@"C:\Temp\pflanze_new.pdf\");
     }
 }

Exception: exc.png (82.0 KB)
File1: pflanze.pdf (2.6 MB)
File2 (7z): pflanze_ocr.7z (133.3 KB)

asad.ali · March 3, 2025, 6:18pm

@BSchwab

For the previous fix, the original code must be modified a bit:

 foreach (TextFragment textFragment in textFragmentCollection)
 {
     textFragment.TextState.Invisible = false;
     list.Add(textFragment);
 }

For the issue that you have now mentioned, we have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.

Issue ID(s): PDFNET-59454

You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.