Input.pdf (972.0 KB)
For any scanned Pdf, how to edit content without degrade quality of Pdf.
We are using OCR, then first we trying to convert Scanned pdf to Searchable PDF using OCR.
Now we convert searchable PDF to word using aspose.pdf. But output word document is not editable, it is still an Image
Once you get a searchable PDF document, you can use below code snippet to create a Word file which will have text to edit in it:
Document pdfDocument = new Document(dataDir + @"135a.pdf");
foreach (var page in pdfDocument.Pages)
{
TextFragmentAbsorber absorber = new TextFragmentAbsorber();
absorber.Visit(page);
foreach (TextFragment fragment in absorber.TextFragments)
{
fragment.TextState.RenderingMode = TextRenderingMode.FillText;
}
page.Resources.Images.Clear();
}
DocSaveOptions saveOptions = new DocSaveOptions();
saveOptions.Format = DocSaveOptions.DocFormat.DocX;
saveOptions.Mode = DocSaveOptions.RecognitionMode.Flow;
//saveOptions.RelativeHorizontalProximity = 2.5f;
//saveOptions.RecognizeBullets = true;
pdfDocument.Save(dataDir + @"output_flow.docx", saveOptions);
This solution will clear all the images from a scanned PDF file.
In case when pdf have images as well as text this will clear all the required images and word file is not as useful.
Also PDF of camera clicked image of any slip/bill or similar PDF this will clear the image and output document becomes empty.
Is there any option where we get editable word file without much changes as compared to PDF file.
Would you please share the searchable scanned PDF with us that you obtained after OCR? We will test the scenario in our environment and address it accordingly.
We have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.
Issue ID(s): PDFNET-54061
You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.