We're sorry Aspose doesn't work properply without JavaScript enabled.

Free Support Forum - aspose.com

From PDF OCR to Word

Hello everybody,
I have a question: I created a PDF using OCR on an image. Then I tried to convert in a Word (DOC or DOCX), but I get an image into word document instead of text.
This is the code I used:

var Pdf = new Aspose.Pdf.Document(stream);

DocSaveOptions saveOptions = new DocSaveOptions();

saveOptions.Mode = DocSaveOptions.RecognitionMode.Flow;
saveOptions.Format = DocSaveOptions.DocFormat.DocX;
saveOptions.RelativeHorizontalProximity = 2.5f;
saveOptions.RecognizeBullets = true;

Pdf.Save(path, saveOptions);

There is an option for this case? Or different way to get a text Word from a PDF OCR?
Thank you.

@alessioabb

Would you kindly share the PDF you are trying to convert into DOC/DOCX format. We will test the scenario in our environment and address it accordingly.

Here you are.
The PDF was produced with Tesseract:.
abOCR.pdf (1.4 MB)

Thank you.

@alessioabb

We were able to replicate the issue in our environment and logged it as PDFNET-47022 in our issue tracking system. We will further look into details of the issue and keep you posted with the status of its correction. Please be patient and spare us little time.

We are sorry for the inconvenience.

No news about the resolution of this problem?

@alessioabb

We are afraid that the earlier logged ticket could not get resolved. However, can you please try using the below code snippet with the 21.8 version of the API and let us know if it helps to resolve the issue?

Document pdfDocument = new Document(dataDir + @"source.pdf");

foreach (var page in pdfDocument.Pages)
{

 TextFragmentAbsorber absorber = new TextFragmentAbsorber();
 absorber.Visit(page);

 foreach (TextFragment fragment in absorber.TextFragments)
 {
  fragment.TextState.RenderingMode = TextRenderingMode.FillText;
 }
 page.Resources.Images.Clear();
}
 
DocSaveOptions saveOptions = new DocSaveOptions();
saveOptions.Format = DocSaveOptions.DocFormat.DocX;
saveOptions.Mode = DocSaveOptions.RecognitionMode.Flow;
saveOptions.RelativeHorizontalProximity = 2.5f;
saveOptions.RecognizeBullets = true;

pdfDocument.Save(dataDir + @"output.docx", saveOptions);