PDFOCR to Word .docx, each page is an image

nqzw · January 26, 2022, 12:52am

I have an issue where a PDF OCR to word conversion is causing the final output of the .docx to have each page as an image. I’ve tried all options available and still each word page is an image

            var destPath = Path.Combine(data.OutPutPath, data.FileName);
            File.Copy(data.SourcePath, destPath);

            // load PDF with an instance of Document
            var document = new Aspose.Pdf.Document(destPath);
            // save document in DOCX format
            DocSaveOptions saveOptions = new DocSaveOptions
            {

                Format = DocSaveOptions.DocFormat.DocX,
                // Set the recognition mode as Flow
                Mode = DocSaveOptions.RecognitionMode.Flow,
                // Set the Horizontal proximity as 2.5
                //RelativeHorizontalProximity = 2.5f,
                // Enable the value to recognize bullets during conversion process
                RecognizeBullets = true,
                
            };

            document.Save($"{data.OutPutPath}\\{data.DocumentId}.docx", saveOptions);

asad.ali · January 26, 2022, 4:04am

@nqzw

Could you please try using the code snippet below and let us know in case you still face any issues? Please share your sample PDF file with us if issue still persists:

Document pdfDocument = new Document(dataDir + @"source.pdf");

foreach (var page in pdfDocument.Pages)
{
 TextFragmentAbsorber absorber = new TextFragmentAbsorber();
 absorber.Visit(page);

 foreach (TextFragment fragment in absorber.TextFragments)
 {
  fragment.TextState.RenderingMode = TextRenderingMode.FillText;
 }
 page.Resources.Images.Clear();
}

 DocSaveOptions saveOptions = new DocSaveOptions();
 saveOptions.Format = DocSaveOptions.DocFormat.DocX;
 saveOptions.Mode = DocSaveOptions.RecognitionMode.Flow;
 saveOptions.RelativeHorizontalProximity = 2.5f;
 saveOptions.RecognizeBullets = true;

 pdfDocument.Save(dataDir + @"output.docx", saveOptions);

nqzw · January 26, 2022, 3:56pm

This looks to be working out perfectly so far. Thank you for this.

We do have another issue related to Aspose. So we have functionality to edit a PDF document via our Flagship web client portal. The edit are done through a 3rd party utility called PDFTron. So you can redact, add edit marks, i.e. Text strikeouts, etc… All these edit are tracked and saved to the pdf source via PDFTron. So even prior to your help we were having issue where some of the edit don’t come through. I have attached an example document and the final word document and you can see that some edit are not coming through. Do you have an idea on how to gets these edits saved correctly to the final .docx

I tried adding this line to your code but it didn’t help
saveOptions.TryMergeAdjacentSameBackgroundImages = true;

nqzw · January 26, 2022, 3:58pm

Here is a Google Drive link to the files

https://drive.google.com/file/d/1-2i_Ac4XVoX_F7KmnC7NbWwQFpFJT8y3/view?usp=sharing

asad.ali · January 26, 2022, 8:55pm

@nqzw

It looks like the annotations are not being exported while converting into DOCX. We have observed that the black redacted portion was not actually redacted in the output Word File. Could you please confirm if we understood the issue correctly so that we can further proceed to assist you accordingly?

nqzw · January 26, 2022, 10:56pm

Yes your are correct in your analysis.

asad.ali · January 27, 2022, 11:39am

@nqzw

We are checking it and will get back to you shortly.

asad.ali · February 4, 2022, 8:38pm

@nqzw

We have noticed that the annotations were also deleted due to the line page.Resources.Images.Clear(); and in the output Word document they were lost. Hence an issue as PDFNET-51294 has been logged in our issue tracking system for further investigation. We will look into its details and keep you posted with the status of its correction. Please be patient and spare us some time.

We are sorry for the inconvenience.