Mass change of TextState.Invisible in TextFragements

Hello, please is there any change for a mass update of TextState.Invisible in TextFragements?

The reason is that we want to modify searchable PDF to get text layer visible and remove images and convert the documents to word. Please find my code below:

  Document pdfDocument = new Document(@"C:\temp\original.pdf");

 foreach (var page in pdfDocument.Pages)
 {

    TextFragmentAbsorber absorber = new TextFragmentAbsorber();
    absorber.Visit(page);
  
    foreach (TextFragment fragment in absorber.TextFragments)
    {
      fragment.TextState.Invisible = false;
      fragment.TextState.ForegroundColor = Color.FromRgb(System.Drawing.Color.Black);
    }
    page.Resources.Images.Clear();
}

  DocSaveOptions saveOptions = new DocSaveOptions();
  saveOptions.Format = DocSaveOptions.DocFormat.DocX;
  saveOptions.Mode = DocSaveOptions.RecognitionMode.Flow;
  saveOptions.RelativeHorizontalProximity = 2.5f;
  saveOptions.RecognizeBullets = true;
  
  pdfDocument.Save(@"C:\temp\outvisible.docx", saveOptions);

Regards, BRonislav

@Bronislav.sopik

Please try using RenderingMode property of TextFragment in order to make OCR text visible like following code snippet:

foreach (TextFragment textFragment in textFragments)
{
  // Change RenderingMode property to make text visible
  textFragment.TextState.RenderingMode = TextRenderingMode.FillText;
}

In case you still face any issue, please share your sample PDF document with us. We will test the scenario in our environment and address it accordingly.

Hi,

thank you so much. Actually it works but it takes really long time for a large documents. Do you have some adive how make it faster:)

The main purpose is to convert searchable PDF to Word or Excel.

Our way is to take a seachable PDF, remove images and make text layer visible.

 Document pdfDocument = new Document(@"C:\temp\original sPDF.pdf");

 foreach (var page in pdfDocument.Pages)
 {

    TextFragmentAbsorber absorber = new TextFragmentAbsorber();
    absorber.Visit(page);

            foreach (TextFragment fragment in absorber.TextFragments)
            {
                fragment.TextState.RenderingMode = TextRenderingMode.FillText;
            }
            page.Resources.Images.Clear();
}

  DocSaveOptions saveOptions = new DocSaveOptions();
  saveOptions.Format = DocSaveOptions.DocFormat.DocX;
  saveOptions.Mode = DocSaveOptions.RecognitionMode.Flow;
  saveOptions.RelativeHorizontalProximity = 2.5f;
  saveOptions.RecognizeBullets = true;

  pdfDocument.Save(@"C:\temp\or2.docx", saveOptions);
  pdfDocument.Save(@"C:\temp\outvisible.xlsx", SaveFormat.Excel);
  pdfDocument.Save(@"C:\temp\or2.pdf", SaveFormat.Pdf);

Also we are facing the issues with size of font. It’s not corresponding to original.

Thank you, Bronislav

@Bronislav.sopik

You are already absorbing text at page-level which is recommended approach for faster processing. However, we would need to further investigate the issue of long time taken by the API. For the purpose, we may need a sample PDF document from your side with which we can replicate the issue in our environment and address it accordingly.

Hello, here is some example of file.
temp2.pdf (5.6 MB)

@Bronislav.sopik

We were able to notice the issue in our environment while testing the scenario with Aspose.PDF for .NET 20.6. We have logged it as PDFNET-48445 in our issue tracking system. We will further look into its details and keep you posted with the status of its rectification. Please be patient and spare us some time.

We are sorry for the inconvenience.

Hi, any update so far?

@Bronislav.sopik

We are afraid that earlier logged ticket is not yet resolved. Please note that it was logged recently in our issue management system and will be investigated/resolved on first come first serve basis. We will surely inform you as soon as we have some definite updates regarding its rectifcation. Please be patient and spare us some time.

We are sorry for the inconvenience.

Hello, do you have some expected date to be solved?

REgards, Bronislav

@Bronislav.sopik

Regretfully, the issue is not yet resolved. We are afraid that we cannot share any ETA at the moment as investigation against it is not yet completed. As soon as the analysis is done, we will be in a position to share some ETA or updates about its resolution. We will let you know as soon as we make some certain progress in this regard. Please give us some time.

We are sorry for your inconvenience.