Input.pdf (972.0 KB)
For any scanned Pdf, how to edit content without degrade quality of Pdf.
We are using OCR, then first we trying to convert Scanned pdf to Searchable PDF using OCR.
Now we convert searchable PDF to word using aspose.pdf. But output word document is not editable, it is still an Image
Once you get a searchable PDF document, you can use below code snippet to create a Word file which will have text to edit in it:
Document pdfDocument = new Document(dataDir + @"135a.pdf");
foreach (var page in pdfDocument.Pages)
{
TextFragmentAbsorber absorber = new TextFragmentAbsorber();
absorber.Visit(page);
foreach (TextFragment fragment in absorber.TextFragments)
{
fragment.TextState.RenderingMode = TextRenderingMode.FillText;
}
page.Resources.Images.Clear();
}
DocSaveOptions saveOptions = new DocSaveOptions();
saveOptions.Format = DocSaveOptions.DocFormat.DocX;
saveOptions.Mode = DocSaveOptions.RecognitionMode.Flow;
//saveOptions.RelativeHorizontalProximity = 2.5f;
//saveOptions.RecognizeBullets = true;
pdfDocument.Save(dataDir + @"output_flow.docx", saveOptions);
This solution will clear all the images from a scanned PDF file.
In case when pdf have images as well as text this will clear all the required images and word file is not as useful.
Also PDF of camera clicked image of any slip/bill or similar PDF this will clear the image and output document becomes empty.
Is there any option where we get editable word file without much changes as compared to PDF file.
Would you please share the searchable scanned PDF with us that you obtained after OCR? We will test the scenario in our environment and address it accordingly.