I have an issue where a PDF OCR to word conversion is causing the final output of the .docx to have each page as an image. I’ve tried all options available and still each word page is an image
var destPath = Path.Combine(data.OutPutPath, data.FileName);
File.Copy(data.SourcePath, destPath);
// load PDF with an instance of Document
var document = new Aspose.Pdf.Document(destPath);
// save document in DOCX format
DocSaveOptions saveOptions = new DocSaveOptions
{
Format = DocSaveOptions.DocFormat.DocX,
// Set the recognition mode as Flow
Mode = DocSaveOptions.RecognitionMode.Flow,
// Set the Horizontal proximity as 2.5
//RelativeHorizontalProximity = 2.5f,
// Enable the value to recognize bullets during conversion process
RecognizeBullets = true,
};
document.Save($"{data.OutPutPath}\\{data.DocumentId}.docx", saveOptions);
Could you please try using the code snippet below and let us know in case you still face any issues? Please share your sample PDF file with us if issue still persists:
Document pdfDocument = new Document(dataDir + @"source.pdf");
foreach (var page in pdfDocument.Pages)
{
TextFragmentAbsorber absorber = new TextFragmentAbsorber();
absorber.Visit(page);
foreach (TextFragment fragment in absorber.TextFragments)
{
fragment.TextState.RenderingMode = TextRenderingMode.FillText;
}
page.Resources.Images.Clear();
}
DocSaveOptions saveOptions = new DocSaveOptions();
saveOptions.Format = DocSaveOptions.DocFormat.DocX;
saveOptions.Mode = DocSaveOptions.RecognitionMode.Flow;
saveOptions.RelativeHorizontalProximity = 2.5f;
saveOptions.RecognizeBullets = true;
pdfDocument.Save(dataDir + @"output.docx", saveOptions);
This looks to be working out perfectly so far. Thank you for this.
We do have another issue related to Aspose. So we have functionality to edit a PDF document via our Flagship web client portal. The edit are done through a 3rd party utility called PDFTron. So you can redact, add edit marks, i.e. Text strikeouts, etc… All these edit are tracked and saved to the pdf source via PDFTron. So even prior to your help we were having issue where some of the edit don’t come through. I have attached an example document and the final word document and you can see that some edit are not coming through. Do you have an idea on how to gets these edits saved correctly to the final .docx
I tried adding this line to your code but it didn’t help
saveOptions.TryMergeAdjacentSameBackgroundImages = true;
It looks like the annotations are not being exported while converting into DOCX. We have observed that the black redacted portion was not actually redacted in the output Word File. Could you please confirm if we understood the issue correctly so that we can further proceed to assist you accordingly?
We have noticed that the annotations were also deleted due to the line page.Resources.Images.Clear(); and in the output Word document they were lost. Hence an issue as PDFNET-51294 has been logged in our issue tracking system for further investigation. We will look into its details and keep you posted with the status of its correction. Please be patient and spare us some time.