[Aspose.PDF.Drawing]Pdf To Docx issue with OCR of a document

IBurzoEvoRWS · May 15, 2025, 9:55am

Hello. Our company utilizes Aspose.PDF to convert Pdfs to Docx.
The code is as follows:

using (var pdfDocument = new Aspose.Pdf.Document(pdfFilePath))
{
var saveOptions = new DocSaveOptions
{
Format = DocSaveOptions.DocFormat.DocX,
Mode = DocSaveOptions.RecognitionMode.Flow,
RecognizeBullets = true,
AddReturnToLineEnd = false,
RelativeHorizontalProximity = 2.5f
};

try
{
    pdfDocument.Save(asposePdfSavePath, saveOptions);
    Console.WriteLine("<<aspose pdf to docx>> OK");
}
catch (Exception ex)
{
    throw ex;
}

}

Using these options, and the attached document here:
source.pdf (1.6 MB)

The docx generated has only pictures and no selectable text. We want to be able to extract the text from it following the conversion.
The versions i tested on were
[Aspose.PDF.Drawing 2024.10.0]
[Aspose.PDF.Drawing 2025.04.0]
We are using Aspose.PDF.Drawing due to running our solution in Linux, using Amazon Linux 2 base image, and net8.

Any idea if we could maybe try other saveOption combinations or if there is a solution for this?
We would like for this conversion to OCR the document and provide the text in text form.
Similar to how Sautinsoft does, please check the attacked picture, which shows both images and text
at the same time in the document:
both on top of each other:
image.png (249.4 KB)

In parallel after moving the image away from the text:
image.png (153.7 KB)

In the second pic you can see the selectable text on the left, and the image on the right.
We would like something similar instead of just the image.
Thank you for reading, hope to hear from you soon

Professionalize.Discourse · May 15, 2025, 9:56am

@IBurzoEvoRWS

To address the issue of converting a PDF to a DOCX format while ensuring that the text is selectable and not just images, you may need to adjust your approach to OCR (Optical Character Recognition) within the Aspose.PDF library.

The code you provided uses DocSaveOptions with the RecognitionMode.Flow, which is intended for recognizing text in a flowing manner. However, if the output is still only images, it suggests that the OCR process may not be effectively recognizing the text in the PDF.

Here are a few suggestions to improve the OCR results:

Check the PDF Content: Ensure that the PDF you are converting contains text that is recognizable. If the text is embedded in images (like scanned documents), you will need to ensure that OCR is applied correctly.
Use Different Recognition Modes: Experiment with different recognition modes available in DocSaveOptions. For instance, you might want to try DocSaveOptions.RecognitionMode.Floating or other modes to see if they yield better results.
Adjust Save Options: You can also try modifying other properties in DocSaveOptions, such as:
- AddReturnToLineEnd: Set this to true if you want to ensure that line breaks are added at the end of lines.
- RelativeHorizontalProximity: Adjust this value to see if it affects the recognition accuracy.
Utilize External OCR Tools: If the built-in OCR capabilities are not sufficient, consider integrating an external OCR tool like Tesseract. You can extract images from the PDF and then apply Tesseract to recognize text from those images. This approach may provide better results for complex documents.
Review Example Implementations: You can refer to example implementations available in the Aspose documentation or GitHub repositories. These examples often showcase various configurations and might provide insights into achieving better OCR results.

Here is a sample code snippet that demonstrates how to set up DocSaveOptions for converting a PDF to DOCX:

using (var pdfDocument = new Aspose.Pdf.Document(pdfFilePath))
{
    var saveOptions = new DocSaveOptions
    {
        Format = DocSaveOptions.DocFormat.DocX,
        Mode = DocSaveOptions.RecognitionMode.Flow, // Try different modes if needed
        RecognizeBullets = true,
        AddReturnToLineEnd = true, // Adjust as necessary
        RelativeHorizontalProximity = 2.5f // Experiment with this value
    };

    try
    {
        pdfDocument.Save(asposePdfSavePath, saveOptions);
        Console.WriteLine("Conversion to DOCX completed successfully.");
    }
    catch (Exception ex)
    {
        Console.WriteLine($"Error during conversion: {ex.Message}");
    }
}

By experimenting with these options and possibly integrating external OCR tools, you should be able to achieve a DOCX output that includes both selectable text and images, similar to the results you are looking for.

If you continue to face issues, consider reaching out to the Aspose support community for more tailored assistance.