Convert Scanned PDF with OCR to DOCX using Aspose.PDF for .NET - Text is not readable in output

BHuggins-LRS · September 16, 2020, 4:16pm

We are trying to convert a PDF document that was scanned and created using an OCR. The PDF has searchable text when viewing in Adobe. When using Aspose to convert the PDF to DOCX the output shows the DOCX with images only on the PDF pages. We based are testing on the article here (Convert PDF to Microsoft Word Documents in .NET|Aspose.PDF for .NET).

Does the PDF need to be a certain version?
Does the PDF need to have certain pieces when it is created to allow document conversion?
What can we do to prevent the output from being an image and allow the text to be editable?
Is there another way Aspose can handle OCR PDFs?

asad.ali · September 17, 2020, 6:20pm

@BHuggins-LRS

The text layer over scanned PDF documents (or OCRd PDF files) uses to be invisible and following code snippet can be tried in order to keep it intact during conversion to DOCX format.

Document pdfDocument = new Document(dataDir + @"source.pdf");

foreach (var page in pdfDocument.Pages)
{
 TextFragmentAbsorber absorber = new TextFragmentAbsorber();
 absorber.Visit(page);
 foreach (TextFragment fragment in absorber.TextFragments)
 {
  fragment.TextState.RenderingMode = TextRenderingMode.FillText;
 }
 page.Resources.Images.Clear();
}

DocSaveOptions saveOptions = new DocSaveOptions();
saveOptions.Format = DocSaveOptions.DocFormat.DocX;
saveOptions.Mode = DocSaveOptions.RecognitionMode.Flow;
saveOptions.RelativeHorizontalProximity = 2.5f;
saveOptions.RecognizeBullets = true;
pdfDocument.Save(dataDir + @"output.docx", saveOptions);

And, as far as the above points are concerned - there are no such limitations in the API in terms of version or having certain pieces. However, you may please share one of your sample PDF files with us for our testing so that we can try to replicate the issue in our environment and address it accordingly.

BHuggins-LRS · September 17, 2020, 7:38pm

Thank you for your reply. We have added the code you provided, the results are not quite what we expected. The text does not display when converting the document to DOCX. It seems as though the text is there, but is invisible.

If I set the ‘invisibile’ text’s font within Word, then the text appears and the finding words in the document are in the correct locations.

Although removing the line “page.Resources.Images.Clear();” does allow the text to appear. The text can be searched but the words are not located in the correct places

word_searchable.pdf (121.8 KB)

asad.ali · September 18, 2020, 7:30pm

@BHuggins-LRS

We have logged an issue as PDFNET-48790 in our issue tracking system for further investigation against this case. We will look into its details and keep you posted with the status of ticket resolution. Please be patient and spare us some time.

We are sorry for the inconvenience.