Hello,
we want to copy (recognized OCR) text from one PDF to another.
This example source code works but is very very slow (the textBuilder.AppendText function takes about 3 minutes for one page of text)
Why is the method so slow? Are there any faster methods to copy text from one PDF to another?
public void CopyOCRx(string file_ocr, string file_non_ocr)
{
using (Aspose.Pdf.Document asposeDocTarget = new Aspose.Pdf.Document(file_non_ocr))
{
Aspose.Pdf.Page page_new = asposeDocTarget.Pages[1];
using (Aspose.Pdf.Document asposeDoc = new Aspose.Pdf.Document(file_ocr))
{
System.Text.RegularExpressions.Regex regex = new System.Text.RegularExpressions.Regex(@"[\S]+");
TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber(regex);
Aspose.Pdf.Page page = asposeDoc.Pages[1];
page.Accept(textFragmentAbsorber);
TextFragmentCollection textFragmentCollection = textFragmentAbsorber.TextFragments;
TextBuilder textBuilder = new TextBuilder(page_new);
List<TextFragment> list = new List<TextFragment>();
foreach (TextFragment textFragment in textFragmentCollection)
{
list.Add(textFragment);
}
textBuilder.AppendText(list);
}
asposeDocTarget.Save("file_ocr_new");
}
@BSchwab
Can you please share the sample PDF document for our reference as well along with the information of your environment and how much time the API takes at your end? Please also make sure that you tested with 24.6 version of the API. We will further proceed to assist you accordingly.
non_ocr.pdf (1.5 MB)
ocr.pdf (1.8 MB)
the aim is to copy the text from the document ocr.pdf into the non_ocr.pdf
runtime is quite exact 3 minutes
the environment should not play a role, the expectation is that a “simple” text copy is done within seconds and not in 3 minutes. a large document takes hours (and crashes after about 2 hours with a memory exeption)
anyway my system.
win 10
Intel(R) Core™ i7-1185G7 @ 3.00GHz
32GB Ram
Another point is - if I try the following:
foreach (TextFragment textFragment in textFragmentCollection)
{
// textBuilder.AppendText(textFragment);
page_new.Paragraphs.Add(textFragment);
}
there is a exception at
asposeDocTarget.Save
→ System.NullReferenceException
But the runtime adding the text is quite good
@BSchwab
We were able to replicate both issues in our environment. We have logged them as below in our issue tracking system.
- PDFNET-57613 - Time taken by the API while copying text
- PDFNET-57614 - Exception while saving PDF after adding Text Fragment
We will look into these details and let you know as soon as the tickets are resolved. Please be patient and spare us some time.
We are sorry for the inconvenience.
Hello, what is the status here?
From my point of view, these are standard features of the API that simply should work.
@BSchwabVal
We agree with you and we will surely investigate the feasibility to support and add these features to the API. However, the tickets are prioritized on first come first serve basis in Free Support model and as soon as we make some progress towards resolution of these ticket(s), we will inform you via this forum thread. Please be patient and spare us some time.
We are sorry for the inconvenience.
Unfortunately, I am very dissatisfied.
This is a basic feature of the API and if you buy it you expect it to work and if not it should be fixed as a priority. First come first serve sounds very unsatisfactory without mentioning a specific time period.
Apart from that, we have several “paid tickets” that have not been resolved for a year - this doesn’t seem to help either. And even then, we have to buy another license to be able to use the version with the bug fixes. Unfortunately, this is all very disappointing
@BSchwabVal
Please accept our humble apology for the inconvenience you have been facing. Please note that the paid support does not guarantee an immediate resolution. It only expedites the investigation process as the ticket gets to receive attention on priority basis. The resolution time of a ticket depends upon many factors like issue complexity, document structure and complexity, modules and components of the API involve in the scenario.
Aspose.PDF is a massive API having thousands of modules and it takes certain amount of time to carry out proper in-depth investigation for certain cases like yours which is related to the performance of the API and is complex in nature.
Nevertheless, we have recorded your comments and used them to raise the ticket(s) priority to the next level. We will consider your concerns and let you know once we have some updates.